المفاهيم الأساسية
MultiVerse is a novel text-to-speech (TTS) system that achieves high-quality, zero-shot, multi-task performance in various conditions (including cross-lingual and speech style transfer) with significantly less training data than traditional data-driven approaches, by leveraging source-filter theory-based disentanglement and a hybrid prosody modeling approach.
الإحصائيات
MultiVerse achieves comparable zero-shot synthesis in both timbre and prosody with only 1/60 of the training data compared to VALL-E.
MultiVerse was trained on approximately 1.2k hours of English and Korean speech data.
VALL-E was trained on over 60k hours of English speech data.
VALL-EX was trained on over 70k hours of English and Chinese speech data.
اقتباسات
"To expand TTS applications in zero-shot conditions, it is crucial to ensure generalization across various speech components, such as content, style, and speaker identity."
"In this paper, we introduce a multi-task TTS system, called MultiVerse, enabling speech synthesis and speech style transfer in zero-shot and cross-lingual conditions, requiring significantly less data compared to the data-driven approaches and featuring enhanced prosody modeling."
"Evaluation results demonstrate that MultiVerse not only achieves zero-shot TTS performance comparable to data-driven TTS systems with much less data, but also significantly outperforms other zero-shot TTS systems trained with the same small amount of data."