Evaluating and Enhancing Fine-Grained Visual-Linguistic Understanding in Vision-Language Models
Current state-of-the-art vision-language models exhibit significant limitations in understanding fine-grained visual-linguistic concepts such as object attributes, relationships, and quantities. A progressive data construction pipeline and a carefully designed benchmark (SPEC) are introduced to comprehensively evaluate these models, revealing their poor performance. A simple yet effective approach is proposed to optimize the models for fine-grained understanding without compromising their zero-shot capabilities.