Core Concepts
A self-supervised learning-based approach for multi-person 3D pose estimation from multiple RGB cameras without requiring annotated datasets.
Abstract
The content presents a novel approach for multi-person 3D pose estimation from multiple RGB cameras. The key highlights are:
The system consists of a three-stage pipeline: 2D skeleton detection, cross-view skeleton matching using a Graph Neural Network (GNN), and 3D pose estimation using a Multi-Layer Perceptron (MLP).
The cross-view skeleton matching and 3D pose estimation components are trained in a self-supervised manner, avoiding the need for annotated datasets with 3D ground truth poses.
The cross-view matching is formulated as a binary classification task using the GNN, which is trained on data of single individuals moving through the environment. The 3D pose estimation is trained by minimizing the reprojection error between the estimated 3D poses and the 2D detected keypoints.
Experiments on the CMU Panoptic and a custom ARP Laboratory dataset show that the proposed approach performs faster than state-of-the-art methods, with comparable accuracy, and without requiring annotated datasets.
The system is also evaluated on a mobile robot setup, demonstrating its ability to work with a reduced number of cameras during inference, while still leveraging additional cameras during training to improve performance.
Stats
The system uses 2D keypoint detections as input, which are obtained using an existing 2D pose detector. The authors report the following key metrics:
Mean per joint position error (MPJPE): 19.77 mm
Mean average precision (mAP): 95.67%
Mean recall (mR): 97.39%
Quotes
"This is the first multi-camera, multi-person data-driven approach that does not require an annotated dataset."
"To deal with these problems and avoid the need for annotated datasets, we propose a self-supervised learning-based solution."