Evaluating the Capabilities of Tool-Augmented Large Language Models in Remote Sensing Applications
Tool-augmented Large Language Models (LLMs) have shown promising capabilities in remote sensing applications, but existing benchmarks fail to capture the nuances of realistic user-grounded tasks. The GeoLLM-QA benchmark aims to bridge this gap by evaluating LLM agents on a diverse set of 1,000 complex remote sensing tasks that require multimodal reasoning and interactions with a real user interface platform.