Core Concepts
Large vision-language models (LVLMs) have recently achieved rapid progress, but current evaluation methods have two primary issues: 1) Many evaluation samples do not require visual understanding, as the answers can be directly inferred from the questions and options or the world knowledge embedded in language models. 2) Unintentional data leakage exists in the training of LLMs and LVLMs, allowing them to answer some visual-necessary questions without accessing the images.
Abstract
The paper identifies two key issues in the current evaluation of large vision-language models (LVLMs):
-
Visual content is unnecessary for many evaluation samples:
- Some samples have answers that can be directly inferred from the questions and options, without requiring visual understanding.
- Other samples can be answered using the world knowledge embedded in large language models (LLMs), without needing the visual input.
- Quantitative analysis shows that a significant portion of samples across popular benchmarks exhibit this issue, with some benchmarks having over 50% of samples that can be solved by LLMs without visual input.
-
Unintentional data leakage exists in the training of LLMs and LVLMs:
- LLMs and LVLMs can sometimes answer visual-necessary questions without accessing the images, suggesting they have memorized these samples during the large-scale training process.
- Detailed experiments show that this data leakage problem is particularly serious for LVLMs, with some models outperforming their LLM backbones on certain benchmarks without using visual input.
To address these issues, the authors introduce the MMStar benchmark, a new elite vision-critical multi-modal benchmark with 1,500 carefully curated samples. MMStar covers 6 core capabilities and 18 detailed axes, aiming to evaluate the actual multi-modal capabilities of LVLMs. Additionally, the authors propose two new metrics, multi-modal gain (MG) and multi-modal leakage (ML), to measure the actual performance gain and data leakage degree in multi-modal training.
Experiments on MMStar and other benchmarks show that the high-resolution version of GPT-4V outperforms 16 leading LLMs and LVLMs, ranking first with 57.1% accuracy. GPT-4V also achieves the best MG and a small ML, indicating its effective multi-modal training strategy and less data leakage.
Stats
GeminiPro achieves 42.9% on the MMMU benchmark without any visual input, outperforming the random choice baseline across six benchmarks by over 20% on average.
Sphinx-X-MoE gets 43.6% on MMMU without accessing images, surpassing its LLM backbone with 17.9%.
Quotes
"Visual content is unnecessary for many samples. The answers can be directly inferred from the questions and options, or the world knowledge embedded in LLMs."
"Unintentional data leakage exists in LLM and LVLM training. LLM and LVLM could still answer some visual-necessary questions without visual content, indicating the memorizing of these samples within large-scale training data."