EarthGPT, a versatile multi-modal large language model, integrates various RS interpretation tasks through visual-enhanced perception, cross-modal mutual comprehension, and unified instruction tuning. Extensive experiments demonstrate its superior performance in scene classification, image captioning, VQA, visual grounding, and object detection. The MMRS-1M dataset facilitates the development of MLLMs in the RS domain by providing diverse image-text pairs based on optical, SAR, and infrared modalities.
To Another Language
from source content
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Wei Zhang,Mi... lúc arxiv.org 03-11-2024
https://arxiv.org/pdf/2401.16822.pdfYêu cầu sâu hơn