DAWN is a novel non-autoregressive diffusion-based framework for generating high-quality, dynamic-length talking head videos from a single portrait and audio, addressing limitations of previous autoregressive methods in terms of speed, error accumulation, and context utilization.
MoDiTalker, a novel motion-disentangled diffusion model, generates high-fidelity talking head videos by explicitly separating the generation process into audio-to-motion and motion-to-video stages.
StyleTalker is a novel audio-driven talking head generation model that can synthesize realistic videos of a talking person from a single reference image with accurate lip-sync, head poses, and eye blinks.