The paper proposes S4, a novel self-supervised approach for semantic segmentation of satellite image time series (SITS). S4 exploits the abundant unlabeled satellite data through two key insights:
Multi-Modal Imagery: Satellites capture images in different parts of the electromagnetic spectrum (e.g. RGB, radar). S4 uses these multi-modal images for cross-modal self-supervision.
Spatial Alignment and Geographic Location: Satellite images are geo-referenced, allowing for spatial alignment between data collected in different parts of the spectrum.
S4 leverages these unique properties of SITS through two main components:
Cross-Modal Reconstruction Network: S4 designs a cross-modal SITS reconstruction network that attempts to reconstruct imagery in one modality (e.g. radar) from the corresponding imagery in another modality (e.g. optical). This encourages the encoder networks to learn meaningful intermediate representations.
MMST Contrastive Learning: S4 formulates a multi-modal, spatio-temporal (MMST) contrastive learning framework that aligns the intermediate representations of different modalities using a contrastive loss. This helps negate the impact of temporary noise (such as cloud cover) that is visible in only one of the input images.
S4 delivers single-modality inference, which is crucial due to real-world constraints where multi-modal data may not be available at inference time. Experiments on two satellite image datasets demonstrate that S4 outperforms competing self-supervised baselines for segmentation, especially when the amount of labeled data is limited.
翻譯成其他語言
從原文內容
arxiv.org
深入探究