The Heron-Bench is a new benchmark introduced in this work for evaluating the Japanese language capabilities of Vision Language Models (VLMs). It consists of 102 image-question-answer pairs covering a variety of topics relevant to the Japanese context, including anime, art, culture, food, landscape, landmarks, and transportation.
The benchmark was constructed by first collecting 21 public domain or CC BY 2.0 licensed images related to Japan. For each image, the authors set up three categories of questions - Conversation, Detail, and Complex. They then manually described the information about each image in detail as context, and used the GPT-4 API to generate model answers for the questions based on the context.
To evaluate VLMs, the images and questions are input into the models, and the generated answers are scored by the GPT-4 API. The scores are calculated as the ratio of the average score of the VLM's answers to the average score of the GPT-4 model answers, based on how well they match the context.
The authors also introduce a baseline Japanese VLM called Heron GIT, which is trained using the visual instruction tuning technique and Japanese image-text pairs. The performance of Heron GIT and other open and closed VLMs are evaluated on the Heron-Bench, as well as the Japanese-translated versions of the LLaVA-Bench (COCO) and LLaVA-Bench (In-the-Wild) datasets.
The results show that closed models like GPT-4V consistently achieve high scores, while open models exhibit varying strengths and weaknesses across different subcategories. The Heron-Bench reveals the capability gap between strong closed models and the baseline Japanese VLM, providing valuable insights for future research in this domain.
The authors release the Heron-Bench dataset and training code to facilitate further developments in Japanese VLM research.
翻譯成其他語言
從原文內容
arxiv.org
深入探究