Core Concepts
Foundation models in VQA can achieve zero-shot performance using specialized agents within a multi-agent system.
Abstract
Abstract:
Introduces Multi-Agent VQA for zero-shot VQA tasks.
Focuses on practicality and robustness without fine-tuning on specific datasets.
Presents preliminary experimental results and failure cases.
1. Introduction:
Rapid emergence of multi-modal foundation models bridging vision and language tasks.
Zero-shot capabilities in VQA largely unexplored, unlike pre-trained LVLMs requiring fine-tuning.
2. Methods:
Adaptive Multi-Agent VQA system pipeline overview.
Utilizes GPT-4V as LVLM and GPT-3.5 as LLM.
3. Experiments:
Datasets:
Evaluation on VQA-v2 and GQA datasets with limited data due to GPT-4V API constraints.
Results:
Comparison of fine-tuned vs. zero-shot models showing limitations of existing approaches.
Ablation study:
Impact analysis of detailed CoT reasoning, CLIP-count agent, and multi-agent pipeline on performance.
Limitations:
Challenges in object counting tasks and reliance on API calls affecting model inference speed.
4. Future work:
Plans to explore different foundation models, prompt engineering, and present a comprehensive zero-shot VQA benchmark.
Stats
LVLMが画像内の重要なオブジェクトを見逃した場合、特別なエージェントを使用してその問題に対処します。
特定のオブジェクトを数える質問がある場合、CLIP-Countエージェントが呼び出されます。
BEiT3-large-indomainやVLMo-large-cocoは、VQA-v2でファインチューニングされていないため、ほとんどゼロの精度を達成します。
Quotes
"Almost all pre-trained large vision-language models (LVLM) in the VQA literature require fine-tuning on specified VQA datasets."
"Our study focuses on the system’s performance without fine-tuning it on specific VQA datasets, making it more practical and robust in the open world."