核心概念
EarthGPT is a pioneering multi-modal large language model designed to unify various remote sensing interpretation tasks effectively, offering superior performance in RS visual interpretation tasks compared to specialist models and MLLMs.
摘要
EarthGPT, a versatile multi-modal large language model, integrates various RS interpretation tasks through visual-enhanced perception, cross-modal mutual comprehension, and unified instruction tuning. Extensive experiments demonstrate its superior performance in scene classification, image captioning, VQA, visual grounding, and object detection. The MMRS-1M dataset facilitates the development of MLLMs in the RS domain by providing diverse image-text pairs based on optical, SAR, and infrared modalities.
統計資料
MMRS-1M dataset features over 1M image-text pairs based on 34 existing diverse RS datasets.
EarthGPT achieves 77.37% accuracy in zero-shot scene classification on the CLRS dataset.
EarthGPT surpasses other specialist models with a top-1 accuracy of 93.84% on the NWPU-RESISC45 dataset.
引述
"EarthGPT offers a versatile paradigm for open-set reasoning tasks."
"Extensive experiments demonstrate EarthGPT’s superior performance in a wide range of RS multi-sensor image comprehension tasks."