Core Concepts
VOICECRAFT achieves state-of-the-art performance on speech editing and zero-shot TTS with innovative token rearrangement.
Abstract
Introduces VOICECRAFT, a Transformer-based neural codec language model achieving SotA in speech editing and zero-shot TTS.
Utilizes a two-step token rearrangement procedure for autoregressive generation with bidirectional context.
Evaluation on challenging datasets shows superior performance compared to prior models.
Includes a high-quality dataset named REALEDIT for realistic speech editing evaluation.
Demonstrates human preference for VOICECRAFT edited speech over original recordings in terms of naturalness.
Stats
VOICECRAFT achieves state-of-the-art performance on both speech editing and zero-shot text-to-speech (TTS) on audiobooks, internet videos, and podcasts1.
Crucially, the models are evaluated on challenging and realistic datasets, that consist of diverse accents, speaking styles, recording conditions, and background noise and music.
For each utterance, we determine the type of modification using predefined probability distributions of editing type, number of disjoint spans to be edited, and editing span length.
The training loss is calculated as the negative log likelihood L(θ) = −log Pθ(Z|W) = −PK k=1 αkLk(θ).
During inference for speech editing tasks, we select masked spans by comparing the original transcript and the target transcript to identify the words that should be masked out.
For zero-shot TTS evaluation, we randomly sampled 80 utterances from LibriTTS and YouTube datasets for human rating evaluations.
Quotes
"I found this um incredible model"
"We introduce VOICECRAFT"
"Our contributions are..."