EarthGPT is a universal multi-modal language model developed for remote sensing image comprehension. It integrates various RS interpretation tasks, including scene classification, image captioning, visual question answering, and object detection. The model proposes a visual-enhanced perception mechanism to refine and incorporate semantic information at different scales. Additionally, it introduces a cross-modal mutual comprehension approach to deepen the understanding of both visual and language content. EarthGPT also presents a unified instruction tuning method for multi-sensor tasks in the RS domain. The MMRS-1M dataset is constructed to address the lack of expertise in MLLMs for RS images. Extensive experiments show EarthGPT's superior performance compared to specialist models and MLLMs in various RS tasks.
To Another Language
from source content
arxiv.org
Principais Insights Extraídos De
by Wei Zhang,Mi... às arxiv.org 03-11-2024
https://arxiv.org/pdf/2401.16822.pdfPerguntas Mais Profundas