VisualWebBench: Evaluating Multimodal Large Language Models' Capabilities in Web Page Understanding and Grounding
VisualWebBench is a comprehensive multimodal benchmark designed to assess the capabilities of Multimodal Large Language Models (MLLMs) in the web domain, covering a variety of tasks such as captioning, webpage QA, OCR, grounding, and reasoning.