The paper introduces CLaM-TTS, a novel approach to zero-shot text-to-speech (TTS) synthesis that leverages neural codec language modeling. The key insights are:
Compression in Token Length: CLaM-TTS uses probabilistic residual vector quantization (RVQ) to achieve superior compression in the token length of the speech representation, addressing the scalability challenge posed by the long sequence length and complexity of modeling multiple token streams in neural audio codecs.
Efficient Token Generation: The proposed method allows the language model to generate multiple tokens at once by predicting a continuous latent representation that is then converted to discrete tokens using the learned RVQ. This eliminates the need for cascaded modeling to handle the number of token streams, further enhancing the efficiency.
The authors train the Mel-VAE model using the proposed RVQ and the latent language model on a large-scale dataset of 100K hours of speech data spanning 11 languages. Experimental results demonstrate that CLaM-TTS outperforms or matches state-of-the-art neural codec-based TTS models in terms of naturalness, intelligibility, speaker similarity, and inference speed. The authors also investigate the impact of the pretraining extent of language models and their text tokenization strategies on the TTS performance.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問