Kernekoncepter
D-PoSE is a novel, lightweight method for estimating 3D human pose and shape from a single RGB image, achieving state-of-the-art accuracy by leveraging depth information learned from synthetic datasets as an intermediate representation.
Statistik
D-PoSE achieves state-of-the-art performance on 3DPW, reducing PA-MPJPE by 3.0mm, MPJPE by 3.1mm, and MVE by 3.6mm compared to BEDLAM-CLIFF (HRNet backbone).
On EMDB, D-PoSE reduces PA-MPJPE by 8.1mm, MPJPE by 11.6mm, and MVE by 14.2mm compared to BEDLAM-CLIFF (HRNet backbone).
Compared to TokenHMR (ViT backbone), D-PoSE reduces PA-MPJPE by 0.4mm, MPJPE by 2.7mm, and MVE by 4.3mm on 3DPW.
D-PoSE has 83.8% fewer parameters than TokenHMR and 82% fewer than HMR2.0.
Ablation study shows that using depth as an intermediate representation improves PA-MPJPE by 0.7mm on 3DPW.
On the RICH dataset, using depth reduces MPJPE by 3.6mm, MVE by 4.3mm, and PA-MPJPE by 2.3mm.
On EMDB, using depth reduces MPJPE by 2.1mm, MVE by 2.8mm, and PA-MPJPE by 0.3mm.
Citater
"We demonstrate that the use of depth information as an intermediate representation together with part segmentation on a simple CNN backbone suffices to deliver state of the art results in terms of both accuracy and model size."
"Despite its simple lightweight design and the CNN backbone, it outperforms ViT-based models that have a number of parameters that is larger by almost an order of magnitude."