Turetzky, A., Shabtay, N., Shechtman, S., Haws, D., Aronowitz, H., Hoory, R., & Dekel, A. (2024). Continuous Speech Synthesis using per-token Latent Diffusion. arXiv preprint arXiv:2410.16048.
This paper investigates the effectiveness of continuous representation learning, specifically using per-token latent diffusion, for zero-shot text-to-speech (TTS) synthesis compared to traditional discrete representation methods.
The authors propose SALAD, a novel per-token latent diffusion model for zero-shot TTS. They develop three variants of SALAD: Text2Acoustic (T2A), Semantic2Acoustic Autoregressive (S2A-AR), and Semantic2Acoustic Non-Autoregressive (S2A-NAR). For each variant, they train comparable models using discrete representations based on Residual Vector Quantization (RVQ) for comparison. The models are trained on the English subset of the multi-lingual LibriSpeech dataset and evaluated on LibriSpeech test-clean using objective metrics (UTMOS, CER, speaker similarity) and subjective listening tests (MOS for speech quality and naturalness, similarity score for speaker similarity).
The study suggests that continuous representation learning using per-token latent diffusion is a viable and competitive approach for zero-shot TTS, potentially outperforming traditional discrete methods in terms of intelligibility while maintaining comparable quality.
This research contributes to the advancement of TTS technology by exploring the potential of continuous representation learning, paving the way for more natural and intelligible synthetic speech.
The diffusion head inference process is slower than RVQ prediction heads. Future research could focus on optimizing the inference speed of diffusion-based models and developing quality metrics for diffusion processes to enable advanced decoding algorithms. Further exploration of multimodal models operating on symmetric representations for both perception and generation is also suggested.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies