Основные понятия
MLLMs struggle with visual math diagrams, relying heavily on textual cues.
Аннотация
Introduction:
Multi-modal Large Language Models (MLLMs) excel in visual contexts but struggle with visual math problem-solving.
MATHVERSE Creation:
MATHVERSE introduces a visual math benchmark to evaluate MLLMs comprehensively.
Dataset includes 2,612 high-quality math problems transformed into six versions for evaluation.
Evaluation Strategy:
Chain-of-Thought (CoT) evaluation assesses the reasoning process of MLLMs step-by-step.
Experimental Results:
Most MLLMs perform better with text-only input, indicating reliance on textual information over visual diagrams.
Key Findings:
GPT-4V and ShareGPT4V show better comprehension of visual content for mathematical reasoning.
Статистика
MATHVERSEは、2,612の高品質な数学問題を収集し、6つの異なるバージョンに変換しています。
一部のMLLMは視覚的情報よりもテキスト情報に依存しており、予期せぬパフォーマンス向上が見られます。