The paper proposes HandDiff, a diffusion-based model for 3D hand pose estimation. The key highlights are:
HandDiff takes depth images and point clouds as input and uses a diffusion process to iteratively refine the 3D hand pose.
It introduces a joint-wise condition extraction module to capture individual joint features, and a local feature-conditioned denoiser to leverage detailed observations around each joint.
The denoiser also incorporates a kinematic correspondence-aware aggregation block to model the dependencies between joints, further enhancing the estimation accuracy.
Extensive experiments on four challenging benchmarks, including single-hand datasets (ICVL, MSRA, NYU) and a hand-object interaction dataset (DexYCB), demonstrate that HandDiff outperforms previous state-of-the-art methods by a significant margin.
Ablation studies validate the effectiveness of the proposed components, including the joint-wise conditions, local features, and kinematic correspondence modeling.
The model can achieve state-of-the-art performance with a small number of denoising steps and multiple hypotheses, enabling efficient inference.
To Another Language
from source content
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Wencan Cheng... lúc arxiv.org 04-05-2024
https://arxiv.org/pdf/2404.03159.pdfYêu cầu sâu hơn