The paper presents SCREWMIMIC, a novel framework for learning bimanual manipulation behaviors from a single human video demonstration. The key insight is that the relative motion between the two hands during bimanual manipulation can be effectively modeled as a screw motion, which constrains the motion in a way that matches physical constraints in the environment or facilitates the manipulation.
SCREWMIMIC consists of three main modules:
Extracting a screw action from a human demonstration: SCREWMIMIC uses off-the-shelf hand tracking to extract the trajectories of the human hands, and then interprets these trajectories as a screw motion between the hands. This provides a compact representation of the demonstrated bimanual behavior.
Predicting a screw action from a point cloud: SCREWMIMIC trains a PointNet-based model to predict the screw action parameters from the 3D point cloud of a novel object. This allows the robot to generalize the learned behavior to new object instances and configurations.
Self-supervised screw-action policy fine-tuning: SCREWMIMIC uses the predicted screw action as an initialization and then refines it through an iterative self-supervised exploration process. This allows the robot to overcome differences in embodiment between the human and the robot.
The experiments demonstrate that SCREWMIMIC can successfully learn complex bimanual manipulation behaviors, such as opening bottles, closing zippers, and stirring pots, from a single human video demonstration. The use of the screw action representation is critical, as it enables efficient exploration and fine-tuning, leading to significantly better performance compared to baselines that do not use this representation.
A otro idioma
del contenido fuente
arxiv.org
Consultas más profundas