toplogo
登入

EXAMS-V: A Multilingual Multimodal Exam Benchmark for Vision Language Models


核心概念
EXAMS-V introduces a challenging multi-discipline, multimodal, multilingual exam benchmark for evaluating vision language models.
摘要
EXAMS-V is a unique benchmark consisting of 20,932 multiple-choice questions across 20 school disciplines in 11 languages. It includes various multimodal features like text, images, tables, figures, and scientific symbols. The dataset requires advanced perception and reasoning over text and visual content. Unlike existing benchmarks, EXAMS-V is curated from school exam questions worldwide with diverse education systems. The dataset aims to challenge models with complex tasks involving integrated visual elements. Evaluation results show the dataset's complexity and significance as a future benchmark.
統計資料
EXAMS-V consists of 20,932 multiple-choice questions across 20 subjects. The dataset includes questions in 11 languages from 7 language families. GPT-4V achieved an overall average score of 42.78% on the test set. Gemini-V followed GPT-4V with an overall average of 31.13%.
引述
"We introduce EXAMS-V, a new challenging multi-discipline multimodal multilingual exam benchmark for evaluating vision language models." "Solving the problems in the dataset requires advanced perception and joint reasoning over the text and the visual content of the image." "Our evaluation results demonstrate that this is a challenging dataset, which is difficult even for advanced vision–text models such as GPT-4V and Gemini."

從以下內容提煉的關鍵洞見

by Rocktim Jyot... arxiv.org 03-18-2024

https://arxiv.org/pdf/2403.10378.pdf
EXAMS-V

深入探究

How does EXAMS-V address the limitations of existing benchmarks?

EXAMS-V addresses several limitations of existing benchmarks in various ways. Firstly, it introduces a new benchmarking approach that requires models to reason over a unified snapshot containing both text and visual elements, unlike traditional benchmarks that separate text and images. This forces models to engage in more sophisticated processing, including distinguishing, preprocessing, and logical reasoning over combined textual and visual information. Additionally, EXAMS-V has a multilingual reach covering 11 languages from 7 language families. This diversity enhances the complexity and applicability of the dataset compared to primarily monolingual benchmarks focused on English. The inclusion of low-resource languages like Croatian, Serbian, Italian further expands its linguistic scope. Furthermore, by curating questions from high school exams across different countries with diverse education systems, EXAMS-V ensures intricate reasoning across diverse languages and relies on region-specific knowledge. This unique curation approach sets it apart from other benchmarks that may not accurately assess model performance due to differences in examination methods or subject matter.

How can the findings from EXAMS-V contribute to advancements in multilingual and multimodal models beyond vision language understanding?

The findings from EXAMS-V can significantly contribute to advancements in multilingual and multimodal models beyond vision language understanding in several ways: Model Performance Improvement: By evaluating state-of-the-art large language models (LLMs) and vision language models (VLMs) on a challenging dataset like EXAMS-V, researchers can identify areas for improvement in these models' capabilities related to complex reasoning tasks involving multiple modalities. Language Support Enhancement: The evaluation results provide insights into model performance across different languages included in the dataset. Researchers can use this information to enhance language support for future iterations of LLMs/VLMs or develop specialized models for specific languages based on performance metrics. Multimodal Reasoning Development: The requirement for joint reasoning over text and visual content within the dataset challenges current VLMs' multimodal reasoning abilities. Findings could lead to advancements in developing more robust multimodal architectures capable of handling integrated visual elements effectively. Generalization Across Subjects: As EXAMS-V covers a wide range of subjects spanning natural sciences, social sciences, arts among others; findings could help improve generalization capabilities of LLMs/VLMs across diverse disciplines beyond just vision-language tasks.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star