The paper investigates the challenges of accurately estimating 3D human pose and shape in video surveillance scenarios, where a stationary camera captures either a single RGB image or a video stream containing multiple pedestrians.
The authors argue that existing metrics, such as MPJPE and PVE, do not adequately assess the performance of 3D human pose and shape estimation methods in these real-world scenarios, as they focus primarily on local pose and shape accuracy rather than the global 3D position of the humans. To address this, the authors advocate for the use of the W-MPJPE and W-PVE metrics, which compute the error in world coordinates without any realignment step.
The paper then introduces the RotAvat approach, a deep-learning-free post-processing step that aims to enhance the alignment of the predicted 3D meshes with the ground plane. RotAvat leverages two key assumptions: that pedestrians are walking on flat ground and standing upright. The approach consists of two main steps:
Auto-calibration: This step regresses the camera calibration parameters from the predicted 2D foot-head pairs of each pedestrian, enabling the estimation of the ground plane.
RotAvat: This step transforms each 3D mesh by rotation, translation, and scaling to align and straighten the meshes on the ground, while ensuring that the adjustments do not significantly alter the results from the camera's point of view.
Through qualitative comparisons, the authors demonstrate the effectiveness of RotAvat in addressing the limitations of existing methods, particularly in terms of accurately positioning the 3D meshes with respect to the ground plane. The proposed approach can be applied as a post-processing step to any existing 3D human pose and shape estimation solution, making it a versatile and practical enhancement for video surveillance scenarios.
Başka Bir Dile
kaynak içeriğinden
arxiv.org
Önemli Bilgiler Şuradan Elde Edildi
by Drazic Marti... : arxiv.org 09-26-2024
https://arxiv.org/pdf/2409.16861.pdfDaha Derin Sorular