The authors introduce GeoLLM-QA, a novel benchmark designed to assess the capabilities of tool-augmented Large Language Models (LLMs) in remote sensing applications. Unlike previous benchmarks that rely on predefined text-image templates, GeoLLM-QA captures the complexities of realistic user-grounded tasks where LLMs need to handle dynamic data structures, nuanced reasoning, and interactions with a real user interface platform.
The benchmark includes 1,000 diverse remote sensing tasks spanning a wide range of applications, such as object detection, change detection, and spatial analysis. The tasks are generated using a three-step process that involves collecting reference templates, LLM-guided question generation, and human-guided ground truth creation.
To comprehensively evaluate the LLM agents, the authors adopt a set of metrics that go beyond traditional text-based scores. These include success rate, correctness ratio, ROUGE-L score, token cost, and detection recall. The authors evaluate several state-of-the-art tool-augmentation and prompting techniques, including Chain-of-Thought, Chameleon, and ReAct, on the GeoLLM-QA benchmark.
The results highlight the strengths and weaknesses of the different approaches, revealing that recent GPT-4 Turbo models exhibit impressive function-calling capabilities, while CoT and ReAct outperform Chameleon in both correctness and success rates. The analysis also uncovers common error types, such as "Missed Function," which account for more than half of the errors across the different methods.
The authors emphasize the importance of a comprehensive evaluation approach, as traditional metrics like ROUGE-L may not accurately capture the performance of LLM agents in this domain. They also discuss potential future directions, such as incorporating open-source GPT-V models, exploring the interaction between agent errors and suboptimal detector performance, and leveraging engine-based benchmarking methodologies to minimize human-in-the-loop overhead.
To Another Language
from source content
arxiv.org
Ключові висновки, отримані з
by Simranjit Si... о arxiv.org 05-03-2024
https://arxiv.org/pdf/2405.00709.pdfГлибші Запити