This paper introduces the first text-guided approach for generating realistic and diverse 3D hand-object interaction sequences.
BAMM is a novel text-to-motion generation framework that unifies generative masked modeling and autoregressive modeling to capture rich and bidirectional dependencies among motion tokens, while learning a direct probabilistic mapping from textual inputs to motion outputs with dynamically-adjusted motion sequence length.