Core Concepts
A framework that leverages the reasoning capabilities of Large Language Models (LLMs) and Large Vision Language Models (LVLMs) to efficiently explore and navigate an unfamiliar 3D environment in search of a target object, by constructing a semantically rich and goal-oriented 3D scene representation.
Abstract
The paper presents a framework that aims to solve the object goal navigation problem in unfamiliar 3D environments by leveraging the reasoning capabilities of Large Language Models (LLMs) and Large Vision Language Models (LVLMs). The key aspects of the framework are:
Open Vocabulary Image Segmentation: The agent uses a combination of models (RAM, Grounding Dino, FastSAM) to perform open vocabulary semantic segmentation of the RGB images, identifying and segmenting objects in the scene.
LLM as a Pruner: The agent uses an LLM's in-context learning abilities to prune the detected object classes, retaining only the most relevant ones for understanding the semantic priors of the environment.
3D Scene Modular Representation: The pruned object detections are used to construct a 3D scene representation, where each object is represented as a node containing information about its position, point cloud, and semantic description. The representation is sparse, except in the vicinity of the target object.
LLM as a Planner: The agent uses an LLM, grounded in the current task, to reason about the 3D scene representation and decide whether to continue exploring the environment or move closer to a detected object that has a high probability of containing the target object.
Short-term Memory and Reasoning: The agent maintains a short-term memory to store processed information about the scene. When the agent reaches a selected object, it retrieves the stored frames to construct hypotheses about the target object using an LVLM.
Execution-level Planner: The agent uses a goal-oriented semantic mapping module to plan low-level actions to reach the goal locations identified by the high-level planner.
The framework is evaluated on the HomeRobot: Open Vocabulary Mobile Manipulation simulation benchmark, and the results show that the LLM-based agent exhibits a similar thought process to humans in exploring the environment, with the short-term memory, pruning, and object captioning modules playing a crucial role in the agent's performance.
Stats
The agent is tasked to find a target object in an unfamiliar 3D environment within 500 steps.
The success rate (SR) and success rate weighted by path length (SPL) are used as evaluation metrics.