Konsep Inti
DAWN is a novel non-autoregressive diffusion-based framework for generating high-quality, dynamic-length talking head videos from a single portrait and audio, addressing limitations of previous autoregressive methods in terms of speed, error accumulation, and context utilization.
This research paper introduces DAWN, a novel framework for generating talking head videos from a single portrait image and an audio clip. The authors argue that existing diffusion-based methods, while effective, rely heavily on autoregressive (AR) or semi-autoregressive (SAR) strategies, leading to slow generation speeds, error accumulation, and limited context utilization.
Research Objective:
The paper aims to address the limitations of AR and SAR methods by proposing a non-autoregressive (NAR) diffusion-based framework called DAWN (Dynamic frame Avatar With Non-autoregressive diffusion) for generating high-quality talking head videos of dynamic length.
Methodology:
DAWN consists of three main components:
Latent Flow Generator (LFG): Trained in a self-supervised manner to extract identity-agnostic motion representations between video frames.
Audio-to-Video Flow Diffusion Model (A2V-FDM): Generates temporally coherent motion representations from audio, conditioned on the source image, audio embedding, and pose/blink signals.
Pose and Blink generation Network (PBNet): Generates natural head pose and blink sequences from audio in a NAR manner, providing explicit control signals to A2V-FDM.
To enhance training and extrapolation capabilities, the authors propose a Two-stage Curriculum Learning (TCL) strategy for A2V-FDM, focusing first on lip motion generation and then on pose/blink control with longer sequences.
Key Findings:
DAWN outperforms state-of-the-art methods in terms of visual quality (FID, FVD), lip-sync accuracy (LSEC, LSED), identity preservation (CSIM), and naturalness of head motion and blinks (BAS, Blink/s) on both CREMA and HDTF datasets.
The NAR approach enables faster generation speeds compared to AR and SAR methods.
The TCL strategy significantly improves the model's convergence and extrapolation ability.
PBNet effectively disentangles pose/blink generation, simplifying A2V-FDM training and enhancing long-term dependency modeling.
Main Conclusions:
DAWN presents a significant advancement in talking head video generation by enabling high-quality, dynamic-length video synthesis with faster generation speeds and improved temporal consistency. The proposed NAR approach, coupled with the TCL strategy and PBNet, effectively addresses limitations of previous methods and paves the way for more efficient and realistic talking head generation.
Significance:
This research contributes significantly to the field of computer vision, specifically in talking head video generation. The proposed NAR approach and TCL strategy offer valuable insights for improving diffusion-based video generation models.
Limitations and Future Research:
While DAWN demonstrates promising results, future research could explore:
Enhancing the model's ability to generate more diverse and expressive facial expressions.
Investigating the potential of NAR diffusion models for other video generation tasks.
Statistik
The CREMA dataset contains 7,442 videos from 91 identities, with durations ranging from 1 to 5 seconds.
The HDTF dataset consists of 410 videos, with an average duration exceeding 100 seconds.
The PBNet model is trained using pose and blink movement sequences of 200 frames.
During the inference phase of the PBNet model, a local attention mechanism with a window size of 400 is employed.
The motion representation space of the A2V-FDM model is 32 × 32 × 3.
In the first stage of TCL, the A2V-FDM model is trained using video clips of 20 frames.
In the second stage of TCL, training is performed on sequences ranging from 30 to 40 frames.
For the inference phase of the A2V-FDM model, local attention with a window size of 80 is applied.