The author introduces Multimodal ArXiv, consisting of ArXivCap and ArXivQA, to improve LVLMs' scientific comprehension by providing diverse figure-caption datasets. Fine-tuning on these datasets significantly enhances LVLMs' mathematical reasoning capabilities.
Multimodal ArXiv introduces ArXivCap and ArXivQA to improve LVLMs' understanding of scientific figures, enhancing mathematical reasoning capabilities.