المفاهيم الأساسية
A hierarchical diffusion framework, DreamHead, is proposed to effectively learn the spatial-temporal correspondence between audio input and facial dynamics for high-quality talking head video synthesis.
الملخص
The proposed DreamHead framework consists of a hierarchy of two diffusion processes:
-
Audio-to-Landmark Diffusion (A2L):
- Takes the audio sequence as input and predicts a temporally smooth and accurate facial landmark sequence.
- Utilizes multiple temporal blocks to efficiently align the audio cues with the landmark dynamics, eliminating jittering artifacts.
-
Landmark-to-Image Diffusion (L2I):
- Generates the final portrait video by learning the spatial correspondence between the predicted landmarks and the facial expressions.
- Leverages self-attention aggregation in the diffusion process to model the spatial relationships between landmarks and appearance.
The hierarchical design allows DreamHead to effectively construct the spatial-temporal correspondence between the audio input and the output talking head video through the intermediate facial landmarks. No ground-truth landmarks are required during inference.
Extensive experiments on two benchmark datasets demonstrate that DreamHead can generate high-fidelity, temporally consistent, and lip-synced talking head videos, outperforming state-of-the-art methods.
الإحصائيات
The proposed DreamHead framework can effectively eliminate jittering artifacts in the generated landmark sequences.
DreamHead produces spatially and temporally consistent portrait videos with accurate lip synchronization to the input audio.
اقتباسات
"Integrating landmarks can ensure temporal consistency between the audio and the final video, and provides explicit spatial constraints for accurate lip movements in synthesized talking head videos."
"The hierarchical design allows DreamHead to effectively construct the spatial-temporal correspondence between the audio input and the output talking head video through the intermediate facial landmarks."