แนวคิดหลัก
Leveraging reinforcement learning from human feedback (RLHF) and image captioning techniques to improve the performance of large language models (LLMs) in answering multimodal physics questions.
บทคัดย่อ
The paper proposes a framework called MM-PhyRLHF to enhance the performance of LLMs in answering multimodal physics questions. The key aspects of the framework are:
-
Dataset Augmentation:
- The authors utilize the MM-PhyQA dataset, which contains multimodal physics questions and answers.
- They create a specialized preference dataset by generating multiple responses for each question using different LLMs and ranking them using Gemini Pro.
- This preference dataset is used to train a reward model (RM) as part of the RLHF process.
-
Image Captioning:
- The authors add detailed captions to the images in the MM-PhyQA dataset to provide the LLM with more context and reduce hallucinations.
- They use the Infi-MM captioning model to generate the captions.
-
RLHF Integration:
- The authors integrate the RLHF methodology into the training process of the LLM to enhance its contextual reasoning capabilities and align its responses with human preferences.
- The trained RM is used to provide feedback to the LLM during the iterative RLHF process, where the LLM's policy is refined based on the predicted rewards.
-
Experiments and Evaluation:
- The authors experiment with different configurations, including fine-tuning the LLaVA models with and without RLHF, and with or without image captions.
- They compare the performance of the LLMs across these settings to assess the impact of the proposed techniques.
The results demonstrate the effectiveness of the RLHF and image captioning approaches in improving the performance of LLMs on multimodal physics question-answering tasks, particularly in the context of Indian high school education.
สถิติ
The MM-PhyQA dataset contains 4,500 multimodal physics questions and answers.
The preference dataset used for RLHF training contains 8,000 paired responses, generated by five different LLMs.
คำพูด
"Combining the LLM's capability of data processing, text generation, contextual reasoning, and pattern recognition with its capabilities of handling multiple modalities has made it a popular option for educational question-answering."
"Improving the multimodal capabilities of models is crucial, especially in the field of question-answering, where the inclusion of images can exponentially increase the understandability of the question and result in more accuracy."