Alapfogalmak
Causal learning can enhance the robustness and generalization of vision-and-language navigation (VLN) agents by mitigating the negative effects of observable and unobservable confounders in the data.
Kivonat
This paper introduces the generalized cross-modal causal transformer (GOAT), a pioneering solution for VLN that leverages causal inference to address the challenge of dataset bias. The key insights are:
The authors construct a unified structural causal model for VLN, considering both observable confounders (e.g., keywords in instructions, room references in environments) and unobservable confounders (e.g., decoration styles, sentence patterns, trajectory trends).
To mitigate the impact of these confounders, the authors propose two causal learning modules: back-door adjustment causal learning (BACL) and front-door adjustment causal learning (FACL). BACL handles observable confounders by blocking the back-door path, while FACL addresses unobservable confounders by constructing a front-door path.
Additionally, the authors introduce a cross-modal feature pooling (CFP) module to effectively aggregate long sequential features and build global confounder dictionaries. Contrastive learning is used to optimize CFP during pre-training.
Extensive experiments across multiple VLN datasets (R2R, REVERIE, RxR, and SOON) demonstrate the superior generalization of the GOAT model compared to previous state-of-the-art approaches. The causal learning pipeline provides valuable insights for enhancing robustness in similar cross-modal tasks.
Statisztikák
The VLN task involves an embodied agent following natural language instructions to navigate real indoor environments.
The Matterport3D simulator is used to provide the environment as graphs with connected navigable nodes.
The agent receives natural language instructions and the current panorama separated into 36 sub-images.
Idézetek
"One way to mitigate dataset bias in VLN is to build broader and more diverse datasets, which is what numerous recent studies have focused on. However, achieving a perfectly balanced dataset devoid of bias is nearly impossible."
"The reason why humans can well execute various instructions and navigate in unknown environments is that we can learn the inherent causality of events beyond biased observation, achieving good analogical association capability."