EarthGPT is a universal multi-modal language model developed for remote sensing image comprehension. It integrates various RS interpretation tasks, including scene classification, image captioning, visual question answering, and object detection. The model proposes a visual-enhanced perception mechanism to refine and incorporate semantic information at different scales. Additionally, it introduces a cross-modal mutual comprehension approach to deepen the understanding of both visual and language content. EarthGPT also presents a unified instruction tuning method for multi-sensor tasks in the RS domain. The MMRS-1M dataset is constructed to address the lack of expertise in MLLMs for RS images. Extensive experiments show EarthGPT's superior performance compared to specialist models and MLLMs in various RS tasks.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Wei Zhang,Mi... a las arxiv.org 03-11-2024
https://arxiv.org/pdf/2401.16822.pdfConsultas más profundas