The authors introduce MMT-Bench, a new benchmark for evaluating the capabilities of Large Vision-Language Models (LVLMs) in multimodal multitask understanding. MMT-Bench comprises 31,325 meticulously curated multi-choice visual questions covering 32 core meta-tasks and 162 subtasks, making it significantly more extensive and challenging than previous multimodal benchmarks.
The key highlights of MMT-Bench are:
Extensive Task Coverage: MMT-Bench covers 32 core meta-tasks and 162 subtasks, testing 14 kinds of multimodal capabilities including visual recognition, localization, reasoning, OCR, counting, 3D perception, and temporal understanding. This breadth of tasks is crucial for evaluating progress towards multitask AGI.
Diverse Input Modalities: The benchmark includes 13 different input image types, such as natural scenes, synthetic images, text-rich images, medical images, and point clouds, demanding LVLMs to have versatile visual understanding abilities.
Comprehensive Evaluation: The authors assess 30 publicly available LVLMs, including open-source and closed-source models, on MMT-Bench. The results highlight the significant challenges posed by the benchmark, with even advanced models like InternVL-Chat, GPT-4V, and GeminiProVision achieving just 63.4%, 62.0%, and 61.6% accuracy, respectively.
The authors also provide in-depth analyses, including the influence of LLM and model scaling, the performance of LVLMs across different meta-tasks, and the effects of using multi-image prompts versus single-image prompts. These insights can guide future LVLM development towards achieving multitask AGI.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Kaining Ying... at arxiv.org 04-25-2024
https://arxiv.org/pdf/2404.16006.pdfDeeper Inquiries