This work presents a novel motion-decoupled framework to directly generate audio-driven co-speech gesture videos, without relying on structural human priors. The key innovations include a nonlinear TPS transformation to extract latent motion features, a transformer-based diffusion model to capture the temporal correlation between gestures and speech, and a refinement network to enhance visual details.
MoDiTalker, a novel motion-disentangled diffusion model, generates high-fidelity talking head videos by explicitly separating the generation process into audio-to-motion and motion-to-video stages.