The paper introduces WebArena, a standalone and self-hostable web environment for building and evaluating autonomous agents. WebArena comprises four fully functional websites representing popular online domains: e-commerce, social forums, collaborative software development, and content management systems. The environment is enriched with utility tools (e.g., map, calculator) and external knowledge bases to emulate human-like task-solving.
The authors also release a benchmark of 812 long-horizon web-based tasks, where each task is described as a high-level natural language intent. The benchmark focuses on evaluating the functional correctness of task completions, which is more reliable than comparing textual surface-form action sequences.
Experiments with several baseline agents, including GPT-4-based models, show that solving complex tasks in WebArena is challenging. The best GPT-4 agent achieves an end-to-end task success rate of only 14.41%, significantly lower than the human performance of 78.24%. These results highlight the need for further development of robust agents and demonstrate that current state-of-the-art large language models are far from perfect in these real-life tasks.
翻譯成其他語言
從原文內容
arxiv.org
深入探究