VisualWebBench is a multimodal benchmark that aims to comprehensively evaluate the web page understanding and grounding capabilities of Multimodal Large Language Models (MLLMs). It consists of seven tasks spanning three different levels: website-level, element-level, and action-level.
The website-level tasks include:
The element-level tasks include:
The action-level tasks include:
VisualWebBench comprises 1.5K instances across 139 real websites, covering 12 different domains and 87 sub-domains. The benchmark is designed to be comprehensive, multi-granular, and high-quality, with careful human verification and curation.
The authors evaluate 14 open-source MLLMs, Gemini Pro, Claude Sonnet, Claude Opus, and GPT-4V(ision) on VisualWebBench. The results reveal significant challenges for current MLLMs, with a notable performance gap between open-source and proprietary models. The analysis also highlights the limitations of current MLLMs, including inadequate grounding in text-rich environments and subpar performance with low-resolution image inputs.
VisualWebBench is expected to serve as a valuable resource for the research community, contributing to the development of more capable and efficient MLLMs for web-related applications.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Junpeng Liu,... kl. arxiv.org 04-10-2024
https://arxiv.org/pdf/2404.05955.pdfDybere Forespørgsler