Comprehensive Evaluation of Text-Generative Vision-Language Models through Adaptive Open-Ended VQA Benchmarking
The authors propose a novel open-ended VQA benchmark that leverages existing visual classification datasets and their semantic hierarchies to enable a granular evaluation of text-generative vision-language models. The benchmark includes follow-up questions to resolve ambiguities and a human evaluation study to select appropriate evaluation metrics.