Core Concepts
The ultimate goal of robotic learning is to acquire a comprehensive and generalizable robotic system capable of performing both seen skills within the training distribution and unseen skills in novel environments. The RH20T-P dataset is proposed to facilitate the development of composable generalization agents by providing a primitive-level robotic dataset with carefully designed primitive skills.
Abstract
The paper proposes the RH20T-P dataset, a primitive-level robotic dataset, to address the limitations of current composable generalization agents (CGAs) in handling novel physical skills.
The key highlights are:
The RH20T dataset is used as the data source, and a subset of tasks are sampled to construct the RH20T-P dataset.
A set of composable and scalable primitive skills are designed, focusing on the state changes in the robot arm's motion and gripper during the manipulation process.
A hindsight annotation pipeline is used to segment each episode in RH20T into video clips and annotate them with the corresponding primitive skills.
To validate the effectiveness of RH20T-P, the paper introduces RA-P, a potential and scalable CGA built on RH20T-P. RA-P utilizes an open-source VLM, LLaVA, as the task planner and a Deformable DETR as the motion planner.
Experiments show that RA-P, equipped with the well-designed primitive skills and spatial information in RH20T-P, can generalize to novel physical skills through composable generalization, outperforming agents using proprietary VLMs and imitation-based methods.
Stats
The RH20T-P dataset contains about 33,000 video clips covering 44 diverse and complicated robotic tasks.
Quotes
"The ultimate goals of robotic learning is to acquire a comprehensive and generalizable robotic system capable of performing both seen skills within the training distribution and unseen skills in novel environments."
"Recent advancements in large language models (LLMs) have shown impressive potential in understanding instructions and interpreting contextual cues."
"Despite the huge promise of CGAs in handling novel skills, they face several challenges. Existing CGAs tend to use larger-scale proprietary models like GPT-4V as decision-making backends, resulting in a lack of transparency and flexibility."