Leveraging Large Language Models for Efficient Vision-Language Modeling in Remote Sensing without Human Annotations
This study introduces an approach to curate large-scale vision-language datasets for the remote sensing domain by employing an image decoding machine learning model, negating the need for human-annotated labels. The resultant model, RSCLIP, outperforms counterparts that did not leverage publicly available vision-language datasets, particularly in downstream tasks such as zero-shot classification, semantic localization, and image-text retrieval.