Core Concepts
A deep learning framework that creates a joint embedding space to align satellite imagery, ground-level images, audio, and text data using satellite images as the binding element.
Abstract
The paper presents a deep learning framework called GeoBind that can create a joint embedding space to align multiple modalities, including satellite imagery, ground-level images, audio, and text data. The key idea is to use satellite imagery as the binding element and contrastively align the other modalities to the satellite image embeddings.
The framework has two main stages:
In the first stage, the satellite image embeddings are aligned with the CLIP embeddings of the corresponding ground-level images. This automatically aligns the satellite image embeddings with the textual descriptions of the ground-level scenery, as the CLIP space already aligns semantically related text and images.
In the second stage, the audio embeddings are contrastively aligned with the satellite image embeddings from the first stage. This results in a final joint embedding space where semantically related satellite imagery, ground-level imagery, audio, and text are pushed together.
The authors show that this framework can create a versatile embedding space that can be used to perform various geospatial tasks involving different modalities, without the need for highly complicated datasets that contain all the modalities. The retrieval experiments demonstrate that the model can reason about multiple modalities for a given satellite image input, unlike traditional unimodal models.
The authors also discuss the potential to add more modalities to the framework by introducing additional stages to align them with the satellite imagery, making the framework easily scalable. This work encourages the development of general-purpose deep learning models that can reason about multiple characteristics of a given satellite image data, rather than relying on a large number of highly specific unimodal models.
Stats
Satellite images have a resolution of 0.6m/px and a size of 800x800 px, acquired from Bing Maps.
The dataset contains 6.1M pairs of overhead and ground-level images, and 50k geo-tagged audio samples paired with satellite images.
Quotes
"Our primary contribution is the introduction of a framework that allows us to create an embedding space with any "n" number of modalities by using satellite imagery to bind them."
"Adding a new modality to an existing embedding space only requires adding a new stage to the process, which makes the framework easily scalable."