Dual-Level Alignment for Vision-and-Language Navigation via Cross-Modal Contrastive Learning
The authors propose a Dual-Level Alignment (DELAN) framework that utilizes cross-modal contrastive learning to align various navigation-related modalities, including instruction, observation, and navigation history, before the fusion stage in order to enhance cross-modal interaction and action decision-making.