Core Concepts
A self-supervised approach, called SelfPose3d, is proposed to estimate 3D poses of multiple persons from multi-view images without using any 2D or 3D ground-truth poses. The approach utilizes synthetic 3D root points, differentiable cross-affine-view 2D joints and heatmap rendering, and an adaptive supervision attention mechanism to learn 3D poses in a self-supervised manner.
Abstract
The paper presents a self-supervised approach, called SelfPose3d, for estimating 3D poses of multiple persons from multi-view images. Unlike current fully-supervised methods that require 2D or 3D ground-truth poses, SelfPose3d only uses multi-view input images and pseudo 2D poses generated from an off-the-shelf 2D human pose estimator.
The key components of the approach are:
Self-supervised 3D root localization: A synthetic dataset of 3D root points and their corresponding multi-view root heatmaps is used to train a 3D root localization model. This model is further regularized using an affine consistency constraint.
Self-supervised 3D pose estimation: The 3D poses are modeled as a bottleneck representation. These 3D poses are projected to 2D joints in each view, which are then rendered into differentiable 2D heatmap representations. The model is trained to minimize the L1 loss between the projected 2D joints and the pseudo 2D poses, as well as the L2 loss between the rendered heatmaps and the pseudo 2D poses.
Adaptive supervision attention: To address the inaccuracies in the pseudo 2D poses, an adaptive supervision attention mechanism is proposed. For the L1 joint loss, a hard attention strategy is used to ignore the view with the largest error. For the L2 heatmap loss, a soft attention mechanism is employed using a lightweight backbone to generate attention heatmaps.
Extensive experiments on the Panoptic, Shelf, and Campus datasets show that SelfPose3d achieves performance comparable to fully-supervised approaches, while significantly outperforming optimization-based methods that do not use ground-truth poses. Qualitative results demonstrate the ability of SelfPose3d to handle occlusions and multiple persons, as well as the plausibility of the estimated 3D poses and body meshes.
Stats
The 3D root localization model is trained on a synthetic dataset of randomly placed 3D points and their corresponding multi-view root heatmaps.
Quotes
"We propose a novel self-supervised learning objective that aims to recover 2d joints and heatmaps under different affine transformations from the bottleneck 3d poses."
"To address the inaccuracies in the pseudo 2d poses, we propose an adaptive supervision attention mechanism."