This paper introduces CORE-Bench, a benchmark designed to measure the ability of AI agents to tackle the task of computational reproducibility. Computational reproducibility involves reproducing the results of a scientific study using the provided code and data, which is fundamental to the scientific process but often challenging in practice.
The CORE-Bench benchmark consists of 270 tasks based on 90 scientific papers across three disciplines: computer science, social science, and medicine. The tasks are divided into three difficulty levels, with varying amounts of information provided to the agent. The benchmark evaluates diverse skills such as coding, shell interaction, retrieval, and tool use.
The authors evaluated two baseline agents on CORE-Bench: the general-purpose AutoGPT and a task-specific agent called CORE-Agent. The results show that while automating computational reproducibility is challenging, task-specific modifications to a generalist agent can significantly improve performance, especially for weaker language models. The best agent achieved an accuracy of 21% on the hardest level of tasks, indicating substantial room for improvement.
The authors highlight the importance of computational reproducibility as a necessary step towards building agents that can conduct novel research. They hope that CORE-Bench can spur the development of future research agents and improve the state of reproducibility in scientific research.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Zachary S. S... lúc arxiv.org 09-18-2024
https://arxiv.org/pdf/2409.11363.pdfYêu cầu sâu hơn