EarthGPT, a versatile multi-modal large language model, integrates various RS interpretation tasks through visual-enhanced perception, cross-modal mutual comprehension, and unified instruction tuning. Extensive experiments demonstrate its superior performance in scene classification, image captioning, VQA, visual grounding, and object detection. The MMRS-1M dataset facilitates the development of MLLMs in the RS domain by providing diverse image-text pairs based on optical, SAR, and infrared modalities.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Wei Zhang,Mi... في arxiv.org 03-11-2024
https://arxiv.org/pdf/2401.16822.pdfاستفسارات أعمق