แนวคิดหลัก
VisualWebBench is a comprehensive multimodal benchmark designed to assess the capabilities of Multimodal Large Language Models (MLLMs) in the web domain, covering a variety of tasks such as captioning, webpage QA, OCR, grounding, and reasoning.
บทคัดย่อ
VisualWebBench is a multimodal benchmark that aims to comprehensively evaluate the web page understanding and grounding capabilities of Multimodal Large Language Models (MLLMs). It consists of seven tasks spanning three different levels: website-level, element-level, and action-level.
The website-level tasks include:
- Captioning: Generating a meta description for a webpage screenshot.
- WebQA: Answering open-ended questions about the content and layout of a webpage.
The element-level tasks include:
- Heading OCR: Recognizing the text of a webpage's heading.
- Element OCR: Recognizing the text content of a lengthy webpage element.
- Element Grounding: Locating a specified webpage element in a screenshot.
The action-level tasks include:
- Action Prediction: Predicting the title of a new webpage after clicking on a specific element.
- Action Grounding: Determining the correct element to click to fulfill a given instruction.
VisualWebBench comprises 1.5K instances across 139 real websites, covering 12 different domains and 87 sub-domains. The benchmark is designed to be comprehensive, multi-granular, and high-quality, with careful human verification and curation.
The authors evaluate 14 open-source MLLMs, Gemini Pro, Claude Sonnet, Claude Opus, and GPT-4V(ision) on VisualWebBench. The results reveal significant challenges for current MLLMs, with a notable performance gap between open-source and proprietary models. The analysis also highlights the limitations of current MLLMs, including inadequate grounding in text-rich environments and subpar performance with low-resolution image inputs.
VisualWebBench is expected to serve as a valuable resource for the research community, contributing to the development of more capable and efficient MLLMs for web-related applications.
สถิติ
VisualWebBench comprises 1.5K instances across 139 real websites, covering 12 different domains and 87 sub-domains.
The benchmark is designed to be comprehensive, multi-granular, and high-quality, with careful human verification and curation.
คำพูด
"VisualWebBench presents significant challenges for current MLLMs, with GPT-4V and Claude Sonnet achieving average scores of 64.6 and 65.8, respectively, indicating substantial room for improvement."
"A notable performance gap exists between open-source MLLMs and proprietary counterparts such as GPT-4V and Claude series, with the leading open-source model, LLaVA-1.6-34B, achieving an average score of 50.5."
"Grounding ability, a crucial skill for developing MLLM-based web applications like autonomous web agents, is a weakness for most MLLMs."