核心概念
Addressing underspecification in visual question inputs can improve zero-shot performance of large vision-language models by incorporating relevant visual details and commonsense reasoning.
摘要
The paper introduces Rephrase, Augment and Reason (REPARE), a gradient-free framework that aims to improve zero-shot performance of large vision-language models (LVLMs) on visual question answering (VQA) tasks.
The key insights are:
- Underspecified questions that lack visual details or implicit reasoning can lead to incorrect answers from LVLMs.
- REPARE interacts with the underlying LVLM to extract salient entities, captions, and rationales from the image. It then fuses this information into the original question to generate modified question candidates.
- An unsupervised scoring function based on the LVLM's confidence in the generated answer is used to select the most promising question candidate.
- Experiments on VQAv2, A-OKVQA, and VizWiz datasets show that REPARE can improve zero-shot accuracy by up to 7.94% across different LVLM architectures.
- Analysis reveals that REPARE questions are more syntactically and semantically complex, indicating reduced underspecification.
- REPARE leverages the asymmetric strengths of the LVLM, allowing the powerful language model component to do more of the task while still benefiting from the image.
统计
"Clocks can tell time, so read the clock to determine the time of day."
"A tall, stone building with a clock tower on top on a cloudy day"