The authors propose a Dual-Level Alignment (DELAN) framework that utilizes cross-modal contrastive learning to align various navigation-related modalities, including instruction, observation, and navigation history, before the fusion stage in order to enhance cross-modal interaction and action decision-making.
Using language as a perceptual representation can improve performance in low-data vision-and-language navigation settings compared to using continuous visual features alone.