A fully end-to-end French text-to-speech synthesis system using the VITS model and HiFiGAN vocoder, with data preprocessing, augmentation, and evaluation on the Blizzard 2023 Challenge.
This study introduces the first TTS vocoder based on 21 hours of detailed Kurdish speech data, significantly advancing Kurdish language technology. The researchers successfully adapted the WaveGlow deep learning architecture to Kurdish, optimizing it for the unique acoustic properties of the language to ensure clear, natural speech output. Advanced prosody modeling techniques were also implemented to improve the rhythm, stress, and intonation of the synthesized speech, crucial for achieving lifelike speech quality.
A novel zero-shot text-to-speech model that utilizes multi-scale acoustic prompts, including a style prompt to capture personal speaking style and a timbre prompt to preserve the target speaker's voice characteristics, outperforming state-of-the-art language model-based approaches in terms of naturalness and speaker similarity.
RALL-E improves the robustness of large language model-based text-to-speech synthesis by incorporating prosody tokens as chain-of-thought prompting and using duration-guided masking to enhance the alignment between phonemes and speech tokens.
Leveraging the interplay between syntactic and acoustic cues to enhance pause prediction and placement for more natural Korean text-to-speech synthesis, even for longer and more complex sentences.
A novel speech synthesis pipeline that generates emotional and disfluent speech patterns in a zero-shot manner using a large language model, enabling more natural and relatable interactions for conversational AI systems.
CM-TTS, a novel architecture based on consistency models, achieves high-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. Weighted samplers are introduced to mitigate biases during model training.
LipVoicer is a novel method that generates high-quality and intelligible speech from silent videos by leveraging a lip-reading model to guide a diffusion-based speech generation model.
AugCondD improves speech quality under limited data conditions while maintaining comparable quality under sufficient data conditions.
Zero-shot TTS systems face challenges in prompting mechanisms, Mega-TTS 2 introduces a generic mechanism to tackle these challenges effectively.