แนวคิดหลัก
A pipeline that enhances a general-purpose Vision Language Model, GPT-4V, to facilitate one-shot visual teaching for robotic manipulation. The system analyzes videos of humans performing tasks and outputs executable robot programs that incorporate insights into affordances.
บทคัดย่อ
The proposed system consists of two main components: a symbolic task planner and an affordance analyzer.
The symbolic task planner takes human video demonstrations, text instructions, or both as input, and outputs a sequence of robot actions. It has three sub-components:
- Video analyzer: Uses GPT-4V to recognize the actions performed by humans in the video and transcribe them into text instructions.
- Scene analyzer: Encodes the text instructions and the first frame of the video into the scenic information of the working environment, including a list of object names, their graspable properties, and spatial relationships.
- Task planner: Outputs a sequence of robot tasks based on the given text instructions and environmental information, using GPT-4.
The affordance analyzer re-analyzes the given videos using the knowledge from the symbolic task planner to acquire the affordance information necessary for effective robot execution. It focuses on the relationship between hands and objects to identify the moments and locations of grasping and releasing, and then extracts various affordance information such as approach directions, grasp types, waypoints, and body postures.
Qualitative experiments using real robots have demonstrated the effectiveness of this pipeline in various scenarios. Quantitative evaluation using a public dataset of cooking videos revealed instances of hallucination in GPT-4V, highlighting the importance of incorporating human supervision within the pipeline.
สถิติ
The system uses the following key metrics and figures:
Normalized Levenshtein distance between the output task sequence and the correct task sequence, ranging from 0-1, where 1 indicates a complete match.
Percentage of videos where GPT-4V correctly transcribed the manipulated object name and the action.
คำพูด
"This research makes three contributions: (1) Proposing a ready-to-use multimodal task planner that utilizes off-the-shelf VLM and LLM (2) Proposing a methodology for aligning GPT-4V's recognition with affordance information for grounded robotic manipulation (3) Making the code publicly accessible as a practical resource for the robotics research community."