Khái niệm cốt lõi
GaussianTalker is a novel framework that utilizes 3D Gaussian Splatting bound to the FLAME model to generate lifelike and realistic talking head videos by associating multimodal data with specific speakers.
Tóm tắt
The paper proposes GaussianTalker, a novel framework for audio-driven talking head synthesis that leverages 3D Gaussian Splatting and the FLAME model. The key highlights are:
-
GaussianTalker consists of two main modules:
- Speaker-Specific Motion Translator: This module decouples identity information from audio features and employs personalized embedding to generate FLAME parameters that closely align with the target speaker's talking style.
- Dynamic Gaussian Renderer: This module binds 3D Gaussians to the FLAME mesh, driving the Gaussians through the deformation of the FLAME model. It also introduces Speaker-specific BlendShapes to enhance the representation of facial details.
-
The framework achieves precise lip synchronization and exceptional visual quality, outperforming state-of-the-art methods in both quantitative and qualitative evaluations.
-
GaussianTalker can render videos at 130 FPS on an NVIDIA RTX4090 GPU, significantly exceeding real-time performance thresholds, and can potentially be deployed on other hardware platforms.
-
Extensive experiments and ablation studies demonstrate the effectiveness of the key components, including the Universal Audio Encoder, Speaker-specific BlendShapes, and Gaussian Semantic Loss.
Thống kê
GaussianTalker achieves a PSNR of 37.08, SSIM of 0.9676, and LPIPS of 0.0239 on the test set.
The Landmark Distance (LMD) metric is 3.278, indicating accurate lip synchronization.
The Lip Sync Error Confidence (LSE-C) and Lip Sync Error Distance (LSE-D) scores are 7.015 and 7.562, respectively, outperforming other methods.
GaussianTalker can render videos at 130 FPS on an NVIDIA RTX4090 GPU.
Trích dẫn
"GaussianTalker consists of two main modules: Speaker-Specific Motion Translator and Dynamic Gaussian Renderer."
"By binding the Gaussians to the geometric topology of the FLAME model, dynamic talking heads can be generated by synchronizing the displacement of the bound Gaussians with changes in the facial attribute parameters."
"To address the challenge of unnatural lip movements caused by inconsistent distributions, we propose a Speaker-specific Motion Translator."
"To confront the challenge posed by unrealistic visual effects caused by the inherent limitations of 3D facial models, the Dynamic Gaussian Renderer introduces Speaker-specific BlendShapes."