Core Concepts
The proposed "Harmonized Transfer Learning and Modality Alignment (HarMA)" method simultaneously satisfies task constraints, modality alignment, and single-modality uniform alignment, while minimizing training overhead through parameter-efficient fine-tuning.
Abstract
The paper addresses the challenges encountered when applying the paradigm of pretraining followed by fine-tuning to remote sensing tasks. Specifically, the tendency for same-modality embeddings to cluster together impedes efficient transfer learning.
To tackle this issue, the authors review the aim of multimodal transfer learning for downstream tasks from a unified perspective, and rethink the optimization process based on three distinct objectives: task constraints, modality alignment, and single-modality uniform alignment.
The authors propose "Harmonized Transfer Learning and Modality Alignment (HarMA)", a method that simultaneously satisfies these three objectives while minimizing training overhead through parameter-efficient fine-tuning. HarMA employs a hierarchical multimodal adapter with mini-adapters, which mimics the human brain's strategy of utilizing shared mini-regions to process neural impulses from both visual and linguistic stimuli. It models the visual-language semantic space from low to high levels by hierarchically sharing multiple mini-adapters.
Additionally, the authors introduce a new objective function, the Adaptive Triplet Loss, to alleviate the severe clustering of features within the same modality. This loss function dynamically adjusts the focus between hard and easy samples, effectively aligning different modality samples at a fine-grained level while preventing over-aggregation among samples of the same modality.
Experiments on two popular remote sensing multimodal retrieval benchmarks, RSICD and RSITMD, demonstrate that HarMA achieves state-of-the-art performance with minimal parameter updates, surpassing even fully fine-tuned models. The method's simplicity allows it to be easily integrated into almost all existing multimodal frameworks.
Stats
Three baseball fields are surrounded by green trees and two rows of red buildings.
The citys environment is good there are a lot of green plants.
A river with dark green water in the middle.
Quotes
"How can we model a highly aligned visual-language joint space while ensuring efficient transfer learning?"
"Inspired by this natural phenomenon, we propose 'Efficient Remote Sensing with Harmonized Transfer Learning and Modality Alignment (HarMA)'."
"Remarkably, without the need for external data for training, HarMA achieves state-of-the-art performance in two popular multimodal retrieval tasks in the field of remote sensing."