Core Concepts
The authors propose a Dual-Level Alignment (DELAN) framework that utilizes cross-modal contrastive learning to align various navigation-related modalities, including instruction, observation, and navigation history, before the fusion stage in order to enhance cross-modal interaction and action decision-making.
Abstract
The paper addresses the challenge of aligning navigation modalities in Vision-and-Language Navigation (VLN) tasks. Existing VLN models primarily focus on cross-modal attention at the fusion stage, but the modality features generated by disparate uni-encoders reside in their own spaces, leading to a decline in the quality of cross-modal fusion and decision.
To address this problem, the authors propose the DELAN framework, which aligns various navigation-related modalities before the fusion stage using cross-modal contrastive learning. Specifically, they divide the pre-fusion alignment into two levels: instruction-history level and landmark-observation level, according to the semantic correlations between the modalities.
For the instruction-history level alignment, the authors employ contrastive loss on the history tokens and the instruction part of a dual-level instruction across both global and local representations. For the landmark-observation level alignment, they use contrastive loss on the observations and the landmark part of the dual-level instruction at each time step across only local representations.
The authors validate their framework across various VLN benchmarks, including R2R, R4R, RxR, and CVDN, demonstrating the effectiveness and consistency of the DELAN framework in improving navigation performance.
Stats
The authors report the following key metrics:
On the R2R dataset, DELAN achieves 62.69% SPL (+1.7%) on the test split.
On the RxR dataset, DELAN gets significant improvements (+1.1% on SPL and +1.0% on SR) on the test split compared to the baselines.
On the R4R dataset, DELAN performs consistently better than the baseline HAMT (+0.9% on SR), especially on path fidelity related metrics (+3.3% on CLS, +3.7% on nDTW and +2.6% on sDTW).
On the CVDN dataset, DELAN enhances the goal progress of previous models, increasing HAMT's performance by 0.27 meters and DUET's by 0.23 meters in the test environments.