核心概念
The author introduces Multimodal ArXiv, consisting of ArXivCap and ArXivQA, to improve LVLMs' scientific comprehension by providing diverse figure-caption datasets. Fine-tuning on these datasets significantly enhances LVLMs' mathematical reasoning capabilities.
摘要
Multimodal ArXiv introduces ArXivCap and ArXivQA to enhance LVLMs' understanding of scientific figures. Training on these datasets improves mathematical reasoning abilities and caption generation for academic figures. The study highlights challenges in understanding scientific figures and the effectiveness of domain-specific training.
The content discusses the creation process of Multimodal ArXiv, including dataset curation, experimental settings, results, analysis, and limitations. It emphasizes the importance of domain-specific training for LVLMs to comprehend scientific literature effectively.
Key points include the introduction of ArXivCap and ArXivQA datasets, experiments validating their effectiveness in enhancing LVLMs' capabilities, evaluation results across various tasks, manual evaluation findings on caption quality, case studies illustrating tuning effects with ArXivQA, and limitations of the study.
统计
Large vision-language models (LVLMs) excel across diverse tasks involving concrete images.
Multimodal ArXiv consists of 6.4M images and 3.9M captions sourced from 572K papers.
Evaluation results show a significant accuracy gain on a multimodal mathematical reasoning benchmark.
The dataset statistics reveal a wide coverage of scientific domains.
Training on the dataset yields substantial performance improvements across all four tasks.
引用
"The inadequacy of training datasets in scientific domains is the main underlying cause."
"Our error analysis uncovers misinterpretations of visual context, recognition errors, and overly simplified captions by current LVLMs."
"Fine-tuning on our dataset yields a significant performance boost for this task."