The FoodieQA dataset is designed to evaluate the fine-grained understanding of Chinese food culture through multiple-choice questions based on visual and textual information. The dataset covers 14 distinct cuisine types across China, reflecting the regional diversity in the country.
The dataset consists of three tasks:
The authors collected a set of non-public images uploaded by local Chinese people to ensure the images are not present in the pretraining data of existing models. They then had native Chinese annotators create the multiple-choice questions and answers, covering a diverse set of question types.
Experiments with state-of-the-art language models and vision-language models reveal that understanding food culture and its regional variations remains a challenging task. While large language models excel at text-based question answering, open-weight vision-language models still fall short by a significant margin compared to human performance, especially on multi-image VQA tasks. The analysis also shows that visual information is crucial for models to correctly answer questions about food culture, and that models exhibit varying strengths in different aspects of food knowledge, such as cooking skills versus flavor profiles.
The FoodieQA dataset aims to advance the boundaries of fine-grained vision-language understanding in the context of food and culture, and the authors encourage the community to create similar datasets for other language and culture groups.
To Another Language
from source content
arxiv.org
Önemli Bilgiler Şuradan Elde Edildi
by Weny... : arxiv.org 10-01-2024
https://arxiv.org/pdf/2406.11030.pdfDaha Derin Sorular