Core Concepts
Even the most advanced multimodal large language models struggle to achieve high performance on a new benchmark, MME-RealWorld, which features high-resolution images and challenging real-world scenarios.
Abstract
The authors introduce a new benchmark called MME-RealWorld to comprehensively evaluate the capabilities of Multimodal Large Language Models (MLLMs). The benchmark addresses several limitations of existing benchmarks:
Data Scale: MME-RealWorld contains 29,429 manually annotated question-answer pairs, making it the largest fully human-annotated dataset.
Data Quality: The benchmark features high-resolution images with an average resolution of 2,000x1,500 pixels, significantly higher than other benchmarks. The annotations are also manually created and cross-checked by professionals.
Task Difficulty: The tasks in MME-RealWorld are extremely challenging, covering 5 real-world domains (Optical Character Recognition, Remote Sensing, Diagram and Table, Monitoring, and Autonomous Driving) and 43 subtasks. Even the most advanced MLLMs struggle to achieve above 60% accuracy on the benchmark.
The authors evaluate 29 prominent MLLMs, including GPT-4o, Gemini 1.5 Pro, and Claude 3.5 Sonnet, on the benchmark. The results show that current models still have significant room for improvement in understanding high-resolution images and complex real-world scenarios. The dataset and evaluation code are publicly released to encourage further research in this direction.
Stats
The benchmark contains 29,429 manually annotated question-answer pairs.
The average image resolution is 2,000x1,500 pixels, significantly higher than existing benchmarks.
The tasks cover 5 real-world domains and 43 subtasks, making them extremely challenging.
Quotes
"Even the most advanced models struggle with our benchmarks, where none of them reach 60% accuracy."
"The challenges of perceiving high-resolution images and understanding complex real-world scenarios remain urgent issues to be addressed."