Core Concepts
Integrating room-level spatial knowledge from pre-built maps and object-level commonsense knowledge from large language models into a commonsense scene graph (CSG) enables superior target localization and efficient object search for household robots.
Abstract
The paper introduces a commonsense scene graph-based target localization (CSG-TL) method that combines room-level spatial knowledge from pre-built maps and object-level commonsense knowledge from large language models (LLMs) to enhance the accuracy of target localization.
The key steps are:
- Constructing a commonsense scene graph (CSG) that integrates the room-level layout of stationary items and object-level commonsense knowledge obtained through LLM prompts.
- Developing the CSG-TL model that leverages the CSG structure to predict the likelihood of correlation between the target object and other objects, enabling efficient target localization.
- Incorporating the CSG-TL into a commonsense scene graph-based object search (CSG-OS) framework, which uses the target localization results to guide the robot's search strategy.
The proposed approach is evaluated on the ScanNet dataset and the AI2THOR simulator, demonstrating superior performance compared to existing methods that rely solely on statistical correlations or partial scene knowledge. The CSG-OS framework is also successfully deployed on a real-world Jackal robot, showcasing its practical applicability.
Stats
The paper reports the following key statistics:
Link prediction accuracy on the ScanNet dataset: CSG-TL achieves 89.73%, outperforming statistical methods (27.56%) and graph-based methods without commonsense (80.03%).
Link prediction accuracy on the AI2THOR single-room environment: CSG-TL achieves 81.09%, compared to statistical methods (47.32%) and graph-based methods without commonsense (73.11%).
Link prediction accuracy on the AI2THOR multi-room environment: CSG-TL achieves 78.21%, compared to statistical methods (31.77%) and graph-based methods without commonsense (65.21%).
Quotes
"To efficiently locate the target object, the robot needs to be equipped with knowledge at both the object and room level. However, existing approaches rely solely on one type of knowledge, leading to unsatisfactory object localization performance and, consequently, inefficient object search processes."
"To leverage LLMs' strengths and address existing works' limitations, we propose a novel commonsense scene graph-based target localization method, CSG-TL. In contrast to previous works dependent on statistical correlations without object-level commonsense or struggles with capturing room-level object correlations due to limited viewpoints, our model captures both the room-level spatial layouts from pre-built maps and object-level commonsense knowledge obtained from LLMs."