핵심 개념
VOICECRAFT achieves state-of-the-art performance in speech editing and zero-shot TTS with innovative token rearrangement.
통계
We introduce VOICECRAFT, a token infilling neural codec language model, that achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts1.
Crucially, the models are evaluated on challenging and realistic datasets, that consist of diverse accents, speaking styles, recording conditions, and background noise and music.
For speech editing evaluation, we introduce a high quality, challenging, and realistic dataset named REALEDIT.
Our contributions include introducing VOICECRAFT for speech editing that generates synthesized speech nearly indistinguishable from in-the-wild recordings according to human listeners.
VOICECRAFT generalizes well to zero-shot TTS without finetuning.
We release a high quality, challenging, and realistic speech editing evaluation dataset REALEDIT.
인용구
"I found this um incredible model"
"I found the amazing VoiceCraft model"