Core Concepts
HumMUSS, a novel attention-free architecture based on state space models, achieves competitive performance on various human motion understanding tasks while offering practical benefits like adaptability to different video frame rates, enhanced training speed, and efficient sequential inference.
Abstract
The paper introduces HumMUSS, a novel attention-free architecture for human motion understanding that leverages state space models (SSMs). Key highlights:
HumMUSS consists of alternating spatial and temporal Gated Diagonal SSM (GDSSM) blocks, designed to efficiently learn rich spatiotemporal features.
HumMUSS inherits the advantages of DSSM, such as faster training and inference for longer sequences, and constant time and memory complexity for real-time sequential inference.
Being a continuous-time model, HumMUSS can seamlessly generalize to dynamic frame rates during inference with minimal performance degradation.
HumMUSS achieves competitive performance on 3D pose estimation, human mesh recovery, and action recognition tasks compared to state-of-the-art transformer-based methods.
The authors also introduce a fully causal version of HumMUSS that outperforms current causal models in terms of accuracy, speed, and memory efficiency, making it suitable for real-time applications.
Extensive experiments demonstrate the practical benefits of HumMUSS over transformer-based approaches, including faster training, more efficient sequential inference, and better adaptability to varying frame rates.
Stats
HumMUSS is 3.8x memory efficient and 11.1x faster than MotionBERT for sequential inference on 243 frames.
HumMUSS maintains high accuracy on the MPI-INF-3DHP dataset even when the input is sub-sampled at higher rates, unlike MotionBERT which sees a significant performance drop.
Quotes
"HumMUSS not only matches the performance of transformer-based models in various motion understanding tasks but also brings added benefits like adaptability to different video frame rates and enhanced training speed when working with longer sequences of keypoints."
"For real-time sequential prediction, our model is both memory efficient and several times faster than transformer-based approaches while maintaining their high accuracy."