The paper introduces the Bidirectional Autoregressive Diffusion (BAD) framework, a novel pre-training strategy for sequence modeling that combines the advantages of autoregressive and mask-based generative models.
The key components of the framework are:
Motion Tokenizer: A simple VQ-VAE is used to transform raw 3D motion sequences into discrete motion tokens.
Conditional Mask-Based Transformer: The transformer is trained to reconstruct the original motion tokens from a corrupted sequence, conditioned on a text prompt. The corruption process utilizes a novel permutation-based technique that preserves the natural sequence structure while enforcing causal dependencies.
The hybrid attention mask constructed during training allows the model to effectively capture both sequential and bidirectional relationships in the motion data. This is in contrast to autoregressive models, which struggle to model complex bidirectional patterns, and mask-based models, which often assume token independence during prediction.
Extensive experiments on the HumanML3D and KIT-ML datasets demonstrate that BAD outperforms state-of-the-art autoregressive and mask-based motion models in terms of motion quality, text-motion alignment, and diversity. Additionally, BAD achieves comparable or superior results to advanced methods that utilize more complex motion tokenizers, highlighting its efficiency and effectiveness.
The paper also showcases the versatility of BAD by applying it to various text-guided motion editing tasks, such as inpainting, outpainting, prefix prediction, and suffix completion, further validating the model's ability to generate coherent and natural human motions.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by S. Rohollah ... klo arxiv.org 09-18-2024
https://arxiv.org/pdf/2409.10847.pdfSyvällisempiä Kysymyksiä