The paper introduces JEAN, a novel method for joint expression and audio-guided NeRF-based talking face generation. The key contributions are:
A self-supervised approach to disentangle facial expressions from lip motion. The method leverages the observation that speech-related mouth motion and expression-related face motion differ temporally and spatially. A self-supervised landmark autoencoder is used to disentangle lip motion from the rest of the face, and a contrastive learning strategy aligns the learned audio features to the lip motion features.
A transformer-based architecture that learns expression features, capturing long-range facial expressions and disentangling them from speech-specific lip motion.
A dynamic NeRF, conditioned on the learned representations for both audio and expression, that can synthesize high-fidelity talking face videos, faithfully following the input facial expressions and speech signal for a given identity.
Quantitative and qualitative evaluations demonstrate that JEAN outperforms state-of-the-art methods in terms of lip synchronization, expression transfer, and identity preservation.
Egy másik nyelvre
a forrásanyagból
arxiv.org
Mélyebb kérdések