The authors present NaviLLM, the first generalist model for embodied navigation that adapts Large Language Models (LLMs) to handle diverse tasks. The key idea is to cast various embodied tasks into generation problems using schema-based instruction, which flexibly unifies tasks like vision-language navigation, object localization, 3D question answering, and trajectory summarization.
The authors first introduce the scene encoding module that extracts multi-view representations from visual observations. Then, they design schema-based instruction, which consists of task, observation, history, and output hint, to transform different embodied tasks into generation problems that can be optimized using a unified cross-entropy objective.
The authors train NaviLLM on a combined dataset covering various embodied tasks and evaluate its performance on multiple benchmarks. The results show that NaviLLM achieves new state-of-the-art on CVDN, SOON, and ScanQA, surpassing previous methods by a significant margin. Moreover, NaviLLM also demonstrates strong generalization capabilities, outperforming task-specific models on unseen tasks like embodied question answering. The authors attribute the success of NaviLLM to its ability to leverage the knowledge from pre-trained LLMs and unify diverse data sources through schema-based instruction.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Duo Zheng,Sh... at arxiv.org 04-02-2024
https://arxiv.org/pdf/2312.02010.pdfDeeper Inquiries