Core Concepts
The EVJVQA dataset provides a challenging benchmark for evaluating multilingual visual question answering systems, covering three languages - Vietnamese, English, and Japanese - on images from Vietnam. The dataset aims to motivate research on developing effective cross-language AI models that can understand visual content and answer questions in diverse languages.
Abstract
The EVJVQA dataset was created to address the lack of multilingual visual question answering (VQA) resources, especially for languages beyond English. The dataset contains over 33,000 question-answer pairs in Vietnamese, English, and Japanese, covering approximately 5,000 images from Vietnam.
The dataset was constructed through a careful process of image collection, question-answer creation, and human translation. The images were selected to capture the diverse cultural and geographical context of Vietnam, going beyond the typical scenes found in existing VQA datasets. The questions and answers were first generated in Vietnamese, then translated to English and Japanese by qualified crowd workers.
The EVJVQA dataset was used as the benchmark for the VLSP 2022 - EVJVQA Challenge, which attracted 62 participant teams from various universities and organizations. The challenge aimed to evaluate the performance of multilingual VQA systems, using F1-score and BLEU as the evaluation metrics.
The top-performing models utilized powerful pre-trained vision and language models, such as ViT and mT5, demonstrating the potential of transfer learning approaches for multilingual VQA. However, the highest F1-score on the private test set was only 0.4392, indicating the dataset's challenging nature and the need for further advancements in multilingual VQA research.
The EVJVQA dataset provides a valuable resource for the research community to explore and develop more effective multilingual VQA systems. It encourages the exploration of cross-lingual knowledge transfer, the incorporation of cultural and contextual understanding, and the advancement of multimodal AI models that can seamlessly handle diverse languages and visual content.
Stats
The average length of questions is 8.7 tokens in Vietnamese, 8.6 tokens in English, and 13.3 tokens in Japanese.
The average length of answers is 7.2 tokens in Vietnamese, 5.0 tokens in English, and 5.9 tokens in Japanese.
Quotes
"EVJVQA is a challenging dataset that motivates NLP and CV researchers to further explore the multilingual models or systems for visual question answering systems."
"The multilingual VQA systems proposed by the top 2 teams use ViT for the pre-trained vision model and mT5 for the pre-trained language model, a powerful pre-trained language model based on the transformer architecture."