Concetti Chiave
This paper introduces the first text-guided approach for generating realistic and diverse 3D hand-object interaction sequences.
Sintesi
The paper proposes a novel framework for generating 3D hand-object interaction sequences from a text prompt and a canonical object mesh. The framework consists of three key components:
Contact map generation: A VAE-based network takes the text prompt and object mesh as input, and generates a 3D contact map that represents the probability of contact between the hand and object surfaces during the interaction. This contact map serves as a strong prior for the subsequent motion generation.
Hand-object motion generation: A Transformer-based diffusion model utilizes the contact map, text features, object features, and scale information to generate physically plausible hand and object motions that align with the input text prompt. The model is trained on an augmented dataset where text labels are manually annotated from existing 3D hand and object motion datasets.
Hand refinement: A Transformer-based hand refiner module takes the generated hand-object motions and refines them to improve the temporal stability of the hand-object contacts and suppress penetration artifacts.
The experiments demonstrate that the proposed framework outperforms existing baselines in terms of accuracy, diversity, physical realism, and alignment with the input text prompts. The framework is also shown to be applicable to unseen objects, beyond the training data.
Statistiche
"Given a text and a canonical object mesh as prompts, we generate 3D motion for hand-object interaction without requiring object trajectory and initial hand pose."
"We represent the right hand with a light skin color and the left hand with a dark skin color."
"The articulation of a box in the first row is controlled by estimating an angle for the pre-defined axis of the box."
Citazioni
"Hand over an apple with both hands."
"Open a box with the right hand."