The paper proposes a novel framework for generating 3D hand-object interaction sequences from a text prompt and a canonical object mesh. The framework consists of three key components:
Contact map generation: A VAE-based network takes the text prompt and object mesh as input, and generates a 3D contact map that represents the probability of contact between the hand and object surfaces during the interaction. This contact map serves as a strong prior for the subsequent motion generation.
Hand-object motion generation: A Transformer-based diffusion model utilizes the contact map, text features, object features, and scale information to generate physically plausible hand and object motions that align with the input text prompt. The model is trained on an augmented dataset where text labels are manually annotated from existing 3D hand and object motion datasets.
Hand refinement: A Transformer-based hand refiner module takes the generated hand-object motions and refines them to improve the temporal stability of the hand-object contacts and suppress penetration artifacts.
The experiments demonstrate that the proposed framework outperforms existing baselines in terms of accuracy, diversity, physical realism, and alignment with the input text prompts. The framework is also shown to be applicable to unseen objects, beyond the training data.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Junuk Cha,Ji... at arxiv.org 04-02-2024
https://arxiv.org/pdf/2404.00562.pdfDeeper Inquiries