Основні поняття
LLaVA-Critic is an open-source large multimodal model designed to evaluate the performance of other AI models across various tasks, offering a cost-effective alternative to proprietary models like GPT-4V and advancing the development of self-critiquing AI.
Анотація
Bibliographic Information: Xiong, T., Wang, X., Guo, D., Ye, Q., Fan, H., Gu, Q., Huang, H., & Li, C. (2024). LLaVA-CRITIC: LEARNING TO EVALUATE MULTIMODAL MODELS. arXiv preprint arXiv:2410.02712.
Research Objective: This paper introduces LLaVA-Critic, the first open-source large multimodal model (LMM) designed to evaluate the performance of other multimodal models across a wide range of tasks. The research aims to demonstrate LLaVA-Critic's effectiveness as a reliable and cost-effective alternative to proprietary evaluation models like GPT-4V.
Methodology: The researchers developed LLaVA-Critic by fine-tuning a pre-trained LLaVA model on a newly curated dataset called LLaVA-Critic-113k. This dataset consists of over 113,000 instruction-response pairs, annotated with evaluation scores and justifications generated by GPT-4o. The dataset covers a diverse range of tasks, including visual chat, detailed captioning, reasoning, and hallucination detection.
Key Findings: The study demonstrates that LLaVA-Critic achieves high correlation with GPT-4o in both pointwise scoring and pairwise ranking of model responses across various multimodal benchmarks. It also shows that LLaVA-Critic can be effectively used in preference learning, where it provides reward signals to improve the performance of other LMMs through techniques like Direct Preference Optimization (DPO).
Main Conclusions: LLaVA-Critic demonstrates the potential of open-source LMMs for self-critique and evaluation of AI models. The authors argue that LLaVA-Critic offers a cost-effective and customizable alternative to proprietary evaluation models, paving the way for more accessible and transparent AI evaluation.
Significance: This research significantly contributes to the field of multimodal machine learning by introducing a robust and open-source evaluation model. This has implications for the development of more reliable, fair, and transparent AI systems.
Limitations and Future Research: While LLaVA-Critic shows promising results, the authors acknowledge the need for further research in developing even more robust and generalizable evaluation models. Future work could explore incorporating a wider range of evaluation criteria, expanding the training dataset, and investigating the model's performance on an even broader set of tasks.
Статистика
The LLaVA-Critic-113k dataset consists of 46k images with 113k evaluation instruction samples.
The pointwise training dataset comprises a total of 18,915 question-image pairs and 72,782 critic data samples.
The pairwise data collection resulted in a total of 40.1k pairwise data samples.
LLaVA-Critic-72B achieves an average Pearson correlation score of 0.754 in pointwise scoring, significantly outperforming the LLaVA-OV-72B baseline (0.634).
In Kendall's Tau, LLaVA-Critic-72B achieves the highest average score of 0.933, again outperforming the LLaVA-OV-72B baseline (0.802).
LLaVA-Critic-72B achieves an average accuracy of 73.6% in pairwise comparisons without ties, outperforming both GPT-4o and GPT-4V.
LLaVA-Critic-72B achieves an accuracy of 60.5% for pairwise comparison with ties and a Kendall's Tau score of 0.779.
LLaVA-Critic-7B achieves an average accuracy of 59.6% in pairwise ranking with ties and 72.2% without ties, alongside a Kendall's Tau of 0.763.
LLaVA-Critic-72B achieves a Pearson similarity score of 0.393 in pointwise scoring on the MLLM-as-a-Judge benchmark.
For pairwise comparisons on the MLLM-as-a-Judge benchmark, LLaVA-Critic-72B achieves accuracy rates of 57.8% and 71.5% with and without ties, respectively.
Цитати
"We introduce LLaVA-Critic, the first open-source large multimodal model (LMM) designed as a generalist evaluator to assess performance across a wide range of multimodal tasks."
"LLaVA-Critic is trained using a high-quality critic instruction-following dataset that incorporates diverse evaluation criteria and scenarios."
"Our experiments demonstrate the model’s effectiveness in two key areas: (i) LMM-as-a-Judge, where LLaVA-Critic provides reliable evaluation scores, performing on par with or surpassing GPT models on multiple evaluation benchmarks; and (ii) Preference Learning, where it generates reward signals for preference learning, enhancing model alignment capabilities."
"This work underscores the potential of open-source LMMs in self-critique and evaluation, setting the stage for future research into scalable, superhuman alignment feedback mechanisms for LMMs."