Li, Y., Xie, R., Sun, X., Cheng, Y., & Kang, Z. (2024). Continuous Speech Tokenizer in Text To Speech. arXiv preprint arXiv:2410.17081.
This paper investigates the limitations of discrete speech tokenization in text-to-speech (TTS) systems and proposes a novel approach using continuous speech tokens to improve speech synthesis quality and robustness.
The authors develop a continuous speech tokenizer that replaces the traditional Residual Vector Quantization (RVQ) with an embedding module to generate continuous representations of speech tokens. They integrate this tokenizer into a TTS model based on an autoregressive language model and utilize optimal-transport conditional flow matching (OT-CFM) for generating mel-spectrograms from the continuous speech features. The model is trained in two stages: pre-training the tokenizer using a VAE-like approach and joint training the entire model with a focus on language modeling.
The study demonstrates the effectiveness of continuous speech tokenization in enhancing TTS systems. By preserving more audio information, the proposed approach achieves higher speech quality, continuity, and robustness compared to traditional discrete methods.
This research contributes to the advancement of TTS technology by introducing a novel and effective approach for speech representation. The findings have implications for developing more natural-sounding and robust TTS systems for various applications.
The study primarily focuses on TTS tasks and hasn't been evaluated on Multimodal Large Language Models (MLLM). Future research could explore the application of continuous speech tokenization in MLLMs and address potential challenges related to training complexity and information overload.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Yixing Li, R... في arxiv.org 10-23-2024
https://arxiv.org/pdf/2410.17081.pdfاستفسارات أعمق