Core Concepts
Existing multimodal large language models (LLMs) exhibit significant limitations in core visual perception abilities, performing far below human levels on a range of classic computer vision tasks.
Abstract
The paper introduces Blink, a new benchmark designed to evaluate the visual perception capabilities of multimodal large language models (LLMs). Blink consists of 14 classic computer vision tasks, ranging from low-level pattern matching to high-level visual understanding. These tasks are reformatted into multiple-choice question-answering problems, with visual prompts and answer choices.
The key findings are:
While humans can solve the Blink tasks with 95.7% accuracy on average, even the best-performing multimodal LLMs like GPT-4V and Gemini Pro achieve only around 50% accuracy, merely 13.17% and 7.63% better than random guessing, respectively. This highlights a significant gap between human and machine visual perception abilities.
Multimodal LLMs perform relatively better on spatial reasoning, art style, and counting tasks, but struggle on pixel-level and crop-level tasks like relative depth estimation, reflectance comparison, and visual correspondence.
Specialist computer vision models significantly outperform multimodal LLMs on Blink tasks, suggesting potential pathways for future improvements in multimodal perception.
Experiments show that reducing Blink to text-only questions using dense image captions can yield comparable or better performance than using multimodal LLMs, indicating the limitations of existing multimodal benchmarks in comprehensively evaluating visual perception.
The authors believe Blink can serve as an effective testbed for bridging the gap between traditional notions of perception and the modern generative capabilities of multimodal LLMs.
Stats
Humans achieve 95.7% average accuracy on Blink.
GPT-4V achieves 51.26% accuracy, only 13.17% better than random guessing.
Gemini Pro achieves 45.72% accuracy, only 7.63% better than random guessing.
Specialist computer vision models outperform multimodal LLMs by 18% to 57% on specific tasks.
Quotes
"While humans can solve the Blink tasks with 95.7% accuracy on average, even the best-performing multimodal LLMs like GPT-4V and Gemini Pro achieve only around 50% accuracy, merely 13.17% and 7.63% better than random guessing, respectively."
"Specialist computer vision models significantly outperform multimodal LLMs on Blink tasks, suggesting potential pathways for future improvements in multimodal perception."