Core Concepts
UrbanCross, a novel framework, enhances satellite image-text retrieval by leveraging cross-domain adaptation techniques to effectively bridge the gap between diverse urban landscapes.
Abstract
The paper presents UrbanCross, a framework that enhances satellite image-text retrieval by addressing the significant domain gaps across diverse urban landscapes. The key highlights are:
-
Data Augmentation:
- UrbanCross integrates the Large Multimodal Model (LMM) with geo-tags to enrich textual descriptions, and employs the Segment Anything Model (SAM) for precise visual segmentation, ensuring contextual and semantic understanding.
- These techniques result in higher-quality data representations, improving the accuracy of multimodal fusion across images, texts, and segmented visual elements.
-
Cross-Domain Adaptation:
- UrbanCross introduces an Adaptive Curriculum-based Source Sampler and a Weighted Adversarial Cross-Domain Fine-tuning Module to enhance adaptability across various domains.
- The curriculum-based sampler progressively integrates more challenging source samples, ensuring smooth adaptation to data distribution changes.
- The adversarial fine-tuning module aligns source and target domain distributions, effectively bridging the domain gap.
-
Extensive Experiments:
- UrbanCross achieves a 10% improvement in retrieval performance and a 15% average boost over methods lacking domain adaptation.
- The framework demonstrates superior efficiency in retrieval and adaptation to new urban environments, highlighting its effectiveness in addressing the challenges posed by diverse data distributions across domains.
Stats
Satellite images have Ground Sample Distance (GSD) ranging from 0.1 to 0.5 m/pixel, covering diverse urban areas in Spain, Germany, and Finland.
The datasets contain 46,041, 165,217, and 59,781 image-text pairs, respectively, with 1,621, 5,826, and 3,033 types of geo-tags.
Quotes
"Enriched with geographic details, satellite imagery serves as a vital resource for comprehending the functionality of a region, with a variety of applications ranging from poverty assessment, crop yield prediction, to urban region profiling."
"This underscores the critical need for cross-domain adaptation to ensure semantically equivalent feature alignment across geographies."