Abstract: Recent advancements in chatbots focus on raw text, prompting the need to evaluate multi-modal models on web pages. TURKINGBENCH introduces a benchmark of 158 tasks using natural HTML pages, challenging state-of-the-art models. The evaluation framework assesses language, vision, and layout models' performance.
Introduction: Progress in AI models is limited to text-only interfaces, hindering web exploration capabilities. TURKINGBENCH addresses this gap by providing diverse web-grounded tasks for evaluation.
Dataset Comparison: TURKINGBENCH stands out from existing benchmarks with interleaved task instructions within web pages and natural data sourced from crowdsourcing platforms.
Challenges and Evaluation: Tasks require multi-modality understanding, interactive actions on web pages, and handling long contexts. Notable models like GPT-4 show promising results but fall short of the benchmark's ceiling performance.
Evaluation Protocol: An evaluation framework facilitates model interaction with web tasks through an action library. Task splits enable measuring generalization to unseen instructions.
Empirical Results: GPT-4 performs well across different input modalities but shows room for improvement compared to the oracle baseline. Varying the number of demonstrations has minimal impact on model performance.
Conclusion: TURKINGBENCH aims to advance research on general-purpose web-based agents by providing a standardized evaluation platform for model development and assessment.
To Another Language
from source content
arxiv.org
Дополнительные вопросы