Core Concepts
Existing 3D human pose and shape estimation methods struggle to accurately predict the global 3D position of humans, particularly in video surveillance scenarios with varying camera perspectives and crowded scenes. The proposed RotAvat approach addresses this limitation by aligning the predicted 3D meshes with the ground plane, improving the overall accuracy of 3D human pose and shape estimation in such real-world settings.
Abstract
The paper investigates the challenges of accurately estimating 3D human pose and shape in video surveillance scenarios, where a stationary camera captures either a single RGB image or a video stream containing multiple pedestrians.
The authors argue that existing metrics, such as MPJPE and PVE, do not adequately assess the performance of 3D human pose and shape estimation methods in these real-world scenarios, as they focus primarily on local pose and shape accuracy rather than the global 3D position of the humans. To address this, the authors advocate for the use of the W-MPJPE and W-PVE metrics, which compute the error in world coordinates without any realignment step.
The paper then introduces the RotAvat approach, a deep-learning-free post-processing step that aims to enhance the alignment of the predicted 3D meshes with the ground plane. RotAvat leverages two key assumptions: that pedestrians are walking on flat ground and standing upright. The approach consists of two main steps:
Auto-calibration: This step regresses the camera calibration parameters from the predicted 2D foot-head pairs of each pedestrian, enabling the estimation of the ground plane.
RotAvat: This step transforms each 3D mesh by rotation, translation, and scaling to align and straighten the meshes on the ground, while ensuring that the adjustments do not significantly alter the results from the camera's point of view.
Through qualitative comparisons, the authors demonstrate the effectiveness of RotAvat in addressing the limitations of existing methods, particularly in terms of accurately positioning the 3D meshes with respect to the ground plane. The proposed approach can be applied as a post-processing step to any existing 3D human pose and shape estimation solution, making it a versatile and practical enhancement for video surveillance scenarios.
Stats
The average height of a pedestrian is assumed to be approximately 170 cm.
Quotes
"The main difficulty stems from the scarcity of sufficient multi-person data with accurate 3D translation annotations for supervision in diverse environments with varying camera perspectives, necessary for purely deep learning-based HPS methods."
"Until such an approach can be trained to convincingly generalize to any scene, an alternative approach (which we advocate here) is the use of global 3D data in a deep-learning free post-processing step."