Large Vision-Language Models struggle with fine-grained visual categorization due to a modality gap, hindering accurate attribute generation.
The author explores the limitations of instruction-tuned Large Vision-Language Models in fine-grained visual categorization, highlighting a modality gap between textual and visual inputs. The proposed FINER benchmark aims to enhance LVLMs' ability in fine-grained image understanding.