This work presents a novel motion-decoupled framework to directly generate audio-driven co-speech gesture videos, without relying on structural human priors. The key innovations include a nonlinear TPS transformation to extract latent motion features, a transformer-based diffusion model to capture the temporal correlation between gestures and speech, and a refinement network to enhance visual details.
A novel self-supervised approach to generate realistic co-speech gesture videos by learning deviations in the latent representation.
TANGO is a novel framework that generates realistic and synchronized co-speech gesture videos by combining a hierarchical audio-motion embedding space for accurate gesture retrieval and a diffusion-based interpolation network for seamless transitions between retrieved video segments.