Conceitos Básicos
A modular embodied AI system that receives open-ended natural language instructions and controls two arms to collaboratively accomplish long-horizon tasks, while incorporating semantic and physical safety mechanisms.
Resumo
The paper presents a modular embodied AI system that enables a robot with two arms to perform complex manipulation tasks based on open-ended natural language instructions. The system consists of four key modules:
LLM Task Planner: A large language model (LLM) is used to map natural language requests to executable robot code, leveraging in-context learning capabilities.
Bi-arm Skills Library: A collection of state-machine based "zero-shot" manipulation skills, including pick, place, handover, and others, that can be orchestrated by the LLM planner.
VLM-PC Perception: A vision-language model (VLM) is used to parse RGB-D images and extract segmented object-centric point clouds, enabling semantic understanding of the scene.
Control: This module includes a point cloud transformer-based grasping policy, a constrained trajectory optimizer for real-time motion planning, and a compliant joint-space tracking controller.
The authors demonstrate the system's performance on several tasks, including bi-arm sorting, bottle opening, and trash disposal, all of which require coordination between the two arms. They show that the modular design enables zero-shot generalization, incorporation of semantic and physical safety constraints, and interpretability of failures.
The key innovations include the seamless integration of learning-based and non-learning-based components, the use of in-context learning to bridge the high-level LLM planner and low-level control, and the emphasis on safety and interpretability throughout the system.
Estatísticas
"We demonstrate performance for the following tasks: bi-arm sorting, bottle opening, and trash disposal tasks. These are done zero-shot where the models used have not been trained with any real world data from this bi-arm robot, scenes or workspace."
"The trajectory optimizer plans in the combined joint-space of both arms, incorporating various kinematic and semantic constraints, and is also used as a feasibility reasoner to pass textual feedback such as "cannot reach the object" back to the human."
Citações
"Composing both learning- and non-learning-based components in a modular fashion with interpretable inputs and outputs allows the user to easily debug points of failures and fragilities."
"The interaction between the LLM and the lower layers of control may be viewed as an instance of System1-System2 architecture popularized by [18]."