Core Concepts
Clio constructs a hierarchical 3D scene graph of the environment in real-time, where the representation only retains task-relevant objects and regions.
Abstract
The paper proposes Clio, a novel approach for building task-driven 3D scene graphs in real-time with embedded open-set semantics.
The key insights are:
- Progress in vision-language models allows specifying tasks as a list of natural language instructions.
- Modern foundation models for task-agnostic segmentation provide a way to over-segment an image into a potentially large number of segments, which can be reprojected to 3D.
The authors formulate task-aware 3D scene understanding as the problem of compressing the task-agnostic primitives into a cluster of task-relevant concepts, which are maximally informative about the given tasks. This is naturally expressed using the Information Bottleneck (IB) principle.
The authors develop an Agglomerative IB algorithm to cluster the 3D primitives into task-relevant objects and regions. They integrate this algorithm into a real-time system, Clio, that constructs a hierarchical 3D scene graph of the environment as the robot explores it.
Clio is evaluated on several datasets, showing that it can construct more parsimonious and useful map representations compared to baselines, while running in real-time. The authors also demonstrate Clio's use on a real robot (Boston Dynamics Spot) for mobile manipulation tasks.
Stats
"Modern tools for class-agnostic image segmentation (e.g., SegmentAnything) and open-set semantic understanding (e.g., CLIP) provide unprecedented opportunities for robot perception and mapping."
"Recent works, including [3–7], build metric-semantic 3D maps by detecting objects and regions corresponding to a closed set of semantic labels."
"Clio not only allows real-time open-set 3D scene graph construction, but also improves the accuracy of task execution by limiting the map to relevant objects and regions."
Quotes
"While traditional closed-set metric-semantic maps were restricted to tens or hundreds of semantic classes, we can now build maps with a plethora of objects and countless semantic variations."
"Humans not only take into account the task when (consciously or unconsciously) deciding which objects to represent and how, but are also able to consequently ignore parts of a scene that are irrelevant to the task [12]."
"Clio takes a list of tasks specified in natural language at the beginning of operation: for instance, these can be the tasks the robot is envisioned to perform during its lifetime or during its current deployment."