The authors introduce a novel task of whole-body human motion forecasting, which aims to jointly predict the future activities of major body joints and hand gestures. This is in contrast to previous works that have focused only on forecasting the major joints of the human body, without considering the important role of hand gestures in human communication and intention expression.
To tackle this challenge, the authors propose an Encoding-Alignment-Interaction (EAI) framework. The key components are:
Intra-context Encoding: The authors extract the spatio-temporal correlations of the major body, left hand, and right hand separately, to capture their distinct motion patterns.
Cross-context Alignment (XCA): The authors introduce cross-neutralization and discrepancy constraints to alleviate the heterogeneity between the different body components, enabling them to be effectively combined.
Cross-context Interaction (XCI): The authors propose a variant of cross-attention to capture the semantic and physical interactions among the different body parts, allowing the coarse-grained (body) and fine-grained (gestures) properties to be cross-facilitated.
The authors conduct extensive experiments on a large-scale benchmark dataset and demonstrate that their EAI framework achieves state-of-the-art performance for both short-term and long-term whole-body motion prediction, outperforming existing methods.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania