LipVoicer is a novel approach for generating high-quality and intelligible speech from silent videos. It consists of three main components:
MelGen: A conditional denoising diffusion probabilistic model (DDPM) that learns to generate a mel-spectrogram from the input video. MelGen is trained using classifier-free guidance.
Lip-reading model: A pre-trained lip-reading network that infers the most likely text from the silent video.
ASR-based guidance: An ASR system that anchors the generated mel-spectrogram to the text predicted by the lip-reader using classifier guidance.
The key innovation of LipVoicer is the use of the text modality, inferred from the lip-reading model, to guide the speech generation process at inference time. This helps alleviate the ambiguities inherent in lip motion and ensures that the generated speech aligns with the spoken text.
LipVoicer is evaluated on the challenging LRS2 and LRS3 datasets, which contain in-the-wild videos with diverse speakers, accents, and speaking styles. The results show that LipVoicer outperforms multiple recent lip-to-speech baselines in terms of intelligibility, naturalness, quality, and synchronization, as measured by both objective metrics and human evaluation.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы