Основні поняття
ERLNet, a novel generative framework, can effectively generate realistic talking head videos with precise control over facial expressions and head movements by leveraging FLAME coefficients as an intermediate representation.
Анотація
The paper proposes Embedded Representation Learning Network (ERLNet), a novel approach for generating style-controllable talking head videos based on Neural Radiance Fields (NeRF).
The key components of ERLNet are:
-
Audio Driven FLAME (ADF) Module:
- Learns the latent representation of facial expressions and head poses using two independent VQ-VAE codebooks.
- Extracts style features from the input style video and combines them with audio features to generate a FLAME coefficients sequence synchronized with the speech.
-
Dual-Branch Fusion NeRF (DBF-NeRF):
- Employs two separate NeRFs (Head-NeRF and Static-NeRF) to model the head and static regions, respectively.
- Fuses the feature maps and density maps from the two NeRFs using a density-based approach to generate the final high-resolution image.
- Incorporates a deformation module to handle the non-rigid motion of the torso region.
The authors also introduce a new dataset called Long-Duration Styled Talking (LDST), which contains long-duration video segments with diverse facial expressions and minimal torso movements.
Extensive experiments demonstrate that ERLNet outperforms existing state-of-the-art methods in terms of image quality, lip synchronization, and style control.
Статистика
"Generating a FLAME coefficients sequence that closely resembles the ground truth as much as possible."
"Minimizing the difference between two encoded features of the cropped mouth region and audio."
Цитати
"Compared to previous state-of-the-art methods, our approach demonstrates the ability to produce higher-quality images while also learning distinct expression styles and head pose styles, thereby enhancing the realism of our generated results."