Incorporating question-driven image captions into a zero-shot visual question answering pipeline can enhance performance across various question types compared to using general-purpose image captions.


coremsg

leveraging-question-driven-image-captions-to-enhance-zero-shot-visual-question-answering


Leveraging Question-Driven Image Captions to Enhance Zero-Shot Visual Question Answering


title_rewrite


Addressing underspecification in visual question inputs can improve zero-shot performance of large vision-language models by incorporating relevant visual details and commonsense reasoning.


improving-zero-shot-visual-question-answering-by-rephrasing-and-augmenting-questions-with-visually-grounded-details


Improving Zero-Shot Visual Question Answering by Rephrasing and Augmenting Questions with Visually-Grounded Details



Foundation models in VQA can achieve zero-shot performance using specialized agents within a multi-agent system.


multi-agent-vqa-exploring-zero-shot-visual-question-answering-with-specialized-agents


Multi-Agent VQA: Exploring Zero-Shot Visual Question Answering with Specialized Agents