The paper introduces JEAN, a novel method for joint expression and audio-guided NeRF-based talking face generation. The key contributions are:
A self-supervised approach to disentangle facial expressions from lip motion. The method leverages the observation that speech-related mouth motion and expression-related face motion differ temporally and spatially. A self-supervised landmark autoencoder is used to disentangle lip motion from the rest of the face, and a contrastive learning strategy aligns the learned audio features to the lip motion features.
A transformer-based architecture that learns expression features, capturing long-range facial expressions and disentangling them from speech-specific lip motion.
A dynamic NeRF, conditioned on the learned representations for both audio and expression, that can synthesize high-fidelity talking face videos, faithfully following the input facial expressions and speech signal for a given identity.
Quantitative and qualitative evaluations demonstrate that JEAN outperforms state-of-the-art methods in terms of lip synchronization, expression transfer, and identity preservation.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Sai Tanmay R... lúc arxiv.org 09-19-2024
https://arxiv.org/pdf/2409.12156.pdfYêu cầu sâu hơn