An Embodied Multi-Modal Agent (EMMA) is trained by imitating an LLM expert in a parallel TextWorld to efficiently complete tasks in a visual environment.
NavCoT introduces a novel strategy for Vision-and-Language Navigation (VLN) by enabling self-guided navigational decision-making through disentangled reasoning, leading to significant performance improvements.
提案されたDOZEデータセットは、動的環境でのオープンボキャブラリー・ゼロショット物体ナビゲーションのためのものであり、静的および動的な障害物、オープンボキャブラリー物体、異なる空間および外観属性を持つ物体、そしてヒントオブジェクトを含んでいます。
Addressing the feasibility of navigating non-stationary targets with routine-based object placement.
The author proposes DOZE, a dataset addressing the limitations of existing datasets for Zero-Shot Object Navigation by incorporating dynamic obstacles, open-vocabulary objects, distinct-attribute objects, and hint objects. The dataset aims to challenge ZSON methods and improve navigation efficiency, safety, and object recognition accuracy.