Core Concepts
This paper introduces Spartun3D, a large-scale situated 3D dataset, and Spartun3D-LLM, a novel 3D-based LLM architecture, to significantly enhance the situated spatial understanding capabilities of LLMs in 3D environments.
Stats
Spartun3D dataset consists of approximately 133k examples, including 10k situated captions and 123k QA pairs.
For object attribute and relation and affordance tasks, around 10 situations per scene were sampled.
For captioning and planning tasks, around 5 situations per scene were sampled.
Human evaluation of Spartun3D showed a high percentage of valid outputs (86%-90%) when using Spa-prompt for spatial information.
In zero-shot SQA3D experiments, LEO trained on Spartun3D showed significant improvement over LEO trained on its original dataset, highlighting the effectiveness of Spartun3D for situated understanding.
Spartun3D-LLM consistently outperformed LEO+Spartun3D across all question types, with improvements of around 2%-3% across all metrics.
Analysis of responses to "which direction" questions in SQA3D revealed that Spartun3D-LLM produced a direction distribution closer to the ground truth compared to LEO, indicating improved situated understanding.
Quotes
"Despite the promising progress, current 3D-based LLMs still fall short in situated understanding, a fundamental capability for completing embodied tasks."
"Situated understanding refers to the ability to interpret and reason about a 3D scene from a dynamic egocentric perspective, where the agent must continuously adjust understanding based on its changing position and evolving environment around it."
"To address the aforementioned issues, we propose two key innovations: we first introduce a scalable, LLM-generated dataset named Spartun3D, consisting of approximately 133k examples."
"Furthermore, based on Spartun3D, we propose a new 3D-based LLM, Spartun3D-LLM, which is built on the most recent state-of-the-art 3D-based LLM, LEO, but integrated with a novel situated spatial alignment module that explicitly aligns 3D visual objects, their attributes and spatial relationship to surrounding objects with corresponding textual descriptions, with the goal of better bridging the gap between the 3D and text spaces."