The paper introduces Bidirectional Autoregressive Motion Model (BAMM), a novel text-to-motion generation framework. BAMM consists of two key components:
Motion Tokenizer: This component encodes raw 3D human motion into discrete tokens in the latent space using a Vector Quantized Variational Autoencoder (VQ-VAE).
Conditional Masked Self-attention Transformer: This component learns to autoregressively predict randomly masked motion tokens by adopting a hybrid attention masking strategy. It utilizes both unidirectional and bidirectional causal masks to capture rich dependencies among motion tokens and enable dynamic prediction of motion sequence length.
By unifying generative masked modeling and autoregressive modeling, BAMM can simultaneously achieve high-quality motion generation with enhanced usability and built-in motion editability. During inference, BAMM employs a cascaded motion decoding approach, where it first generates a coarse-grained motion sequence using unidirectional autoregressive decoding, and then refines it through bidirectional autoregressive decoding.
Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that BAMM outperforms current state-of-the-art text-to-motion generation methods in both qualitative and quantitative measures. BAMM also supports various motion editing tasks, such as inpainting, outpainting, prefix prediction, and suffix completion, without requiring specialized training for these tasks.
翻譯成其他語言
從原文內容
arxiv.org
深入探究