Conceitos essenciais
This study introduces an approach to curate large-scale vision-language datasets for the remote sensing domain by employing an image decoding machine learning model, negating the need for human-annotated labels. The resultant model, RSCLIP, outperforms counterparts that did not leverage publicly available vision-language datasets, particularly in downstream tasks such as zero-shot classification, semantic localization, and image-text retrieval.
Resumo
This paper presents a methodology to create a robust vision-language dataset tailored specifically for the remote sensing domain. The authors leverage the potential of the InstructBLIP model to extract vision-language pairs from remote sensing images, sourced from various reputable datasets. This process yields approximately 9.6 million vision-language paired datasets.
Building upon this curated dataset, the authors introduce RSCLIP, a vision-language foundation model trained within the well-established CLIP framework. RSCLIP demonstrates superior performance compared to models that did not leverage publicly available vision-language datasets, particularly in downstream tasks such as zero-shot classification, semantic localization, and image-text retrieval.
The authors also conduct additional experiments to compare RSCLIP with models that directly utilize vision-language pairs from downstream tasks. While these models may outperform RSCLIP in certain tasks, RSCLIP remains highly competitive, especially in tasks that rely solely on the vision encoder, such as few-shot classification, linear probing, and k-NN classification.
The key highlights of this work are:
- Developing a methodology to create a large-scale vision-language dataset for the remote sensing domain using an image decoding model, without the need for human annotations.
- Introducing RSCLIP, a vision-language foundation model that outperforms counterparts in various downstream tasks, including zero-shot classification, semantic localization, and image-text retrieval.
- Demonstrating the effectiveness of RSCLIP, even when compared to models that directly utilize vision-language pairs from downstream tasks, particularly in tasks relying solely on the vision encoder.
Estatísticas
The proposed approach generated approximately 9.6 million vision-language paired datasets.
RSCLIP outperformed counterparts in image-text retrieval, achieving the highest mean recall across RSICD and RSITMD datasets.
In zero-shot classification, RSCLIP achieved the best top-1 accuracy on the AID and RESISC45 datasets, with an average accuracy of 72.20%.
For semantic localization on the AIR-SLT dataset, RSCLIP recorded the best performance in the comprehensive indicator Rmi.
Citações
"The prominence of generalized foundation models in vision-language integration has witnessed a surge, given their multifarious applications."
"Addressing these challenges has led to an intensified focus on vision-language foundation models within the remote sensing community."
"Utilizing this methodology, we amassed approximately 9.6 million vision-language paired datasets in VHR imagery."