Bibliographic Information: Wang, J., Ma, Z., Li, Y., Zhang, S., Chen, C., Chen, K., & Le, X. (2024). GTA: A Benchmark for General Tool Agents. Advances in Neural Information Processing Systems, 38.
Research Objective: This paper introduces GTA, a novel benchmark designed to evaluate the tool-use capabilities of large language models (LLMs) in real-world scenarios, addressing the limitations of existing benchmarks that rely on artificial or overly simplistic tasks.
Methodology: The researchers developed GTA with a focus on three key aspects: 1) Real user queries: Human-designed queries reflecting real-world tasks with implicit tool-use requirements, demanding reasoning and planning from LLMs. 2) Real deployed tools: An evaluation platform equipped with executable tools across perception, operation, logic, and creativity categories, enabling assessment of actual task execution. 3) Real multimodal inputs: Authentic image files, such as spatial scenes and web page screenshots, providing context for the queries and aligning with real-world scenarios. The benchmark comprises 229 real-world tasks with corresponding executable tool chains, allowing for fine-grained evaluation of LLM performance.
Key Findings: Evaluation of 16 mainstream LLMs, including GPT-4 and open-source models, revealed that existing LLMs struggle with real-world tool-use tasks. GPT-4 achieved an accuracy of less than 50%, while most other models scored below 25%. The study identified argument prediction as a significant bottleneck, highlighting the difficulty LLMs face in correctly identifying and formatting arguments for tool invocation.
Main Conclusions: GTA provides a challenging benchmark for evaluating and advancing the tool-use capabilities of LLMs in real-world scenarios. The authors emphasize the need for future research to focus on improving argument prediction and enhancing the reasoning and planning abilities of LLMs for effective tool integration.
Significance: This research significantly contributes to the field of LLM evaluation by introducing a more realistic and challenging benchmark that better reflects real-world tool-use scenarios. The findings highlight current limitations and provide valuable insights for guiding future research on tool-augmented LLMs.
Limitations and Future Research: The benchmark currently lacks language diversity, focusing solely on English queries. Future work could expand GTA to include multilingual queries and explore methods for reducing the reliance on human-annotated data.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Jize Wang, Z... kl. arxiv.org 11-25-2024
https://arxiv.org/pdf/2407.08713.pdfDybere Forespørgsler