This study introduces an approach to curate large-scale vision-language datasets for the remote sensing domain by employing an image decoding machine learning model, negating the need for human-annotated labels. The resultant model, RSCLIP, outperforms counterparts that did not leverage publicly available vision-language datasets, particularly in downstream tasks such as zero-shot classification, semantic localization, and image-text retrieval.


coremsg

leveraging-large-language-models-for-efficient-vision-language-modeling-in-remote-sensing-without-human-annotations


Leveraging Large Language Models for Efficient Vision-Language Modeling in Remote Sensing without Human Annotations