Core Concepts
The author proposes a novel evaluation benchmark, CogBench, to assess the high-level cognitive abilities of Large Vision Language Models (LVLMs) using images with rich semantics. The evaluation reveals a significant gap in cognitive ability between LVLMs and humans.
Abstract
CogBench introduces a unique evaluation benchmark focusing on high-level cognitive reasoning abilities of LVLMs. The study highlights the gap in cognitive abilities between LVLMs and humans, emphasizing the need for further development in this area. The dataset construction, image collection criteria, annotation process, tasks design, and evaluation strategies are detailed to provide insights into the comprehensive assessment of LVLMs' cognitive capabilities.
The study showcases experiments with selected LVLMs on both Description and Visual Question Answering tasks from CogBench. Results indicate varying levels of performance across models, with GPT-4V consistently outperforming other open-source models. Recognition scores and cognition scores are analyzed to demonstrate the strengths and weaknesses of each model in understanding images at a high level of reasoning.
Furthermore, limitations regarding the dataset size and ethical considerations are acknowledged. Future updates to CogBench aim to include more high-quality images while maintaining strict collection criteria. Ethical considerations ensure fair treatment of annotators and adherence to data usage guidelines.
Stats
Recognition Score: 0.73
Cognition Score: 0
Recognition Score: 0.27
Cognition Score: 0.07
Quotes
"There is still a large gap between the cognitive ability of LVLMs and humans."
"CogBench defines eight core cognitive reasoning capabilities."
"GPT-4V achieves the best performance in terms of recognition."