The proposed Decoupled Space-Time Aggregation (DSTA) network efficiently models the spatial and temporal dependencies of human pose joints in video sequences, outperforming previous regression-based methods and achieving performance on par with state-of-the-art heatmap-based methods.
The proposed Kinematics Modeling Network (KIMNet) explicitly models the temporal correlation between joints across different frames to improve the robustness and accuracy of video-based human pose estimation.