Core Concepts
A curious robot autonomously discovers and learns unsupervised categories grounded in its physical interactions and visual observations, laying the foundation for symbol grounding.
Abstract
The paper presents a system that leverages a curious robot to autonomously discover and learn unsupervised categories in a bottom-up manner, without starting with pre-defined symbols. The key aspects are:
The robot is equipped with an approximate model of curiosity that drives it to explore its sensorimotor space and build categories based on its physical actions and visual observations, rather than starting with pre-defined symbols.
The robot uses the Explauto framework to model its curiosity, with a Sensorimotor Model that learns the mapping between motor actions and sensory effects, and an Interest Model that selects actions to improve the predictions of the Sensorimotor Model.
The robot uses object detection (YOLO, SAM) and visual representation (CLIP, DINOv2) models to process its visual input and represent the categories it discovers.
The authors conduct a series of experiments, starting with a baseline setup and then incrementally extending it to use higher dimensional visual features and a novel region splitting mechanism based on cosine similarity. This allows the robot to discover categories that closely align with the objects in the environment.
The final evaluation uses the Words-as-Classifiers (WAC) model to ground the discovered categories into word-level classifiers, demonstrating the potential for the robot's autonomous category discovery to serve as a foundation for symbol grounding.
Overall, the paper presents a novel approach to the symbol grounding problem, where the robot discovers categories in a bottom-up, unsupervised manner, laying the groundwork for grounding symbols into these categories.
Stats
The robot's motor actions consist of degrees of rotation (left or right) and linear travel (forward or backward) in millimeters.
The robot's sensory input is a visual vector representing the observed object, either from YOLO+CLIP (512 dimensions) or SAM+DINOv2 (384 dimensions).
Quotes
"Towards addressing the Symbol Grounding Problem and motivated by early childhood language development, we leverage a robot which has been equipped with an approximate model of curiosity with particular focus on bottom-up building of unsupervised categories grounded in the physical world."
"It is our hope that by enabling the robot to autonomously identify unsupervised categories throughout the space without proactively or immediately assigning labels, we are able to explore meaning with a bottom-up perspective."