The authors propose a self-supervised pre-training approach for 3D human understanding tasks. The key idea is to leverage pairs of images depicting the same person from different viewpoints (cross-view) or in different poses (cross-pose) to learn priors about 3D human structure and motion.
The pre-training process involves a masked image modeling objective, where parts of the first image in a pair are masked, and the model is trained to reconstruct the masked regions using the visible parts of the first image as well as the second image in the pair. This allows the model to learn to leverage both geometric and motion cues to reason about the 3D human body.
The authors pre-train two models, one focused on full-body tasks (CroCo-Body) and one on hand-centric tasks (CroCo-Hand), using a diverse set of multi-view and video datasets. They then fine-tune these pre-trained models on a range of downstream tasks, including body and hand mesh recovery, dense pose estimation, gesture classification, and grasp recognition.
The results show that the proposed self-supervised pre-training approach outperforms both random initialization and pre-training on ImageNet or other self-supervised methods when fine-tuning on the target human-centric tasks. The authors also demonstrate the data efficiency of their approach, showing that the pre-trained models can achieve strong performance even with limited fine-tuning data.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問