Incorporating question-driven image captions into a zero-shot visual question answering pipeline can enhance performance across various question types compared to using general-purpose image captions.
Addressing underspecification in visual question inputs can improve zero-shot performance of large vision-language models by incorporating relevant visual details and commonsense reasoning.
Foundation models in VQA can achieve zero-shot performance using specialized agents within a multi-agent system.