核心概念
VOICECRAFT achieves state-of-the-art performance on speech editing and zero-shot TTS with innovative token rearrangement.
统计
VOICECRAFT achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts1.
Crucially, the models are evaluated on challenging and realistic datasets, that consist of diverse accents, speaking styles, recording conditions, and background noise and music.
For each utterance, we determine the type of modification using predefined probability distributions of editing type, number of disjoint spans to be edited, and editing span length.
The training loss is calculated as the negative log likelihood L(θ) = −log Pθ(Z|W) = −PK k=1 αkLk(θ).
During inference for speech editing tasks, we select masked spans by comparing the original transcript and the target transcript to identify the words that should be masked out.
For zero-shot TTS evaluation, we randomly sampled 80 utterances from LibriTTS and YouTube datasets for human rating evaluations.
引用
"I found this um incredible model"
"We introduce VOICECRAFT"
"Our contributions are..."