The paper introduces a novel multimodal dataset, MM-PhyQA, containing challenging high school-level physics questions. It evaluates the performance of contemporary large language models (LLMs) and large multimodal models (LMMs) on this dataset, both with and without the incorporation of multimodal elements.
The key highlights and insights are:
Text-only LLMs like Mistral-7b and LLaMA2-7b struggle with complex multimodal physics questions, exhibiting low accuracy scores of 25.95% and 42.83% respectively.
Multimodal models like LLaVA-1.5 perform significantly better, with the 13b variant fine-tuned with a LoRA rank of 128 and using the novel Multi-Image Chain-of-Thought (MI-CoT) Prompting technique achieving the highest accuracy of 71.65% on the test set.
Fine-tuning general-purpose LLMs and LMMs on the dataset leads to substantial performance improvements compared to using them in a zero-shot setting.
The MI-CoT Prompting technique, which incorporates multiple images during the Chain-of-Thought prompting process, further boosts the reasoning capabilities of the models as evidenced by the higher ROUGE scores.
Error analysis reveals that the best-performing model still struggles with conceptual, grounding, and computational errors, highlighting the need for continued research and development in this area.
To Another Language
from source content
arxiv.org
ข้อมูลเชิงลึกที่สำคัญจาก
by Avinash Anan... ที่ arxiv.org 04-16-2024
https://arxiv.org/pdf/2404.08704.pdfสอบถามเพิ่มเติม