The paper proposes a generative model called G-HOP that can jointly capture the 3D geometry of hands and objects during interactions. The key contributions are:
Interaction Grid Representation: The authors propose a homogeneous representation called "interaction grids" that concatenates the signed distance field of the object and a skeletal distance field of the hand. This allows the diffusion model to effectively reason about the 3D hand-object interactions.
Generative Hand-Object Prior: The authors train a denoising diffusion model on this interaction grid representation, aggregating data from 7 diverse real-world datasets spanning 155 object categories. This learned generative prior can capture plausible hand-object configurations.
Prior-Guided Inference: The authors show that the learned generative prior can be leveraged to guide inference for two tasks: (i) reconstructing 3D hand-object shapes from everyday interaction videos, and (ii) synthesizing plausible human grasps given an object mesh. They incorporate the prior's log-likelihood gradients into an optimization framework to combine it with task-specific objectives.
The experiments demonstrate that the proposed generative prior outperforms task-specific baselines on both interaction reconstruction and grasp synthesis, highlighting the benefits of jointly modeling hands and objects.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Yufei Ye,Abh... at arxiv.org 04-19-2024
https://arxiv.org/pdf/2404.12383.pdfDeeper Inquiries