The study explores the impact of incorporating image captioning as an intermediary process within a zero-shot visual question answering (VQA) pipeline. It evaluates the performance of different image captioning models, including CogVLM, FuseCap, and BLIP-2, in the context of VQA on the GQA dataset.
The key findings are:
Using question-driven image captions, where keywords from the question are used to generate the caption, improves VQA performance across most question categories compared to using general-purpose image captions.
The question-driven captioning approach utilizing the CogVLM-chat variant outperforms other image captioning methods in evaluations with different cosine similarity thresholds and exact matching.
The question-driven captions provide significant performance enhancements in the "verify" category for yes/no questions, as well as the "attribute" and "category" types focused on identifying and describing object properties.
Limiting the image captions to the most relevant sentence reduces the overall performance, suggesting that comprehensive and context-rich captions are necessary for optimal VQA performance.
The VQA performance achieved by combining question-driven image captions with GPT-3.5 exceeds the zero-shot performance of the BLIP-2 FlanT5XL model in most question categories, but falls short of the CogVLM-chat model's VQA performance.
The study highlights the potential of employing question-driven image captions and leveraging the capabilities of large language models to achieve competitive performance on the GQA dataset in a zero-shot setting.
Іншою мовою
із вихідного контенту
arxiv.org
Глибші Запити