A novel zero-shot text-to-speech model that utilizes multi-scale acoustic prompts, including a style prompt to capture personal speaking style and a timbre prompt to preserve the target speaker's voice characteristics, outperforming state-of-the-art language model-based approaches in terms of naturalness and speaker similarity.
RALL-E improves the robustness of large language model-based text-to-speech synthesis by incorporating prosody tokens as chain-of-thought prompting and using duration-guided masking to enhance the alignment between phonemes and speech tokens.
Leveraging the interplay between syntactic and acoustic cues to enhance pause prediction and placement for more natural Korean text-to-speech synthesis, even for longer and more complex sentences.
A novel speech synthesis pipeline that generates emotional and disfluent speech patterns in a zero-shot manner using a large language model, enabling more natural and relatable interactions for conversational AI systems.
CM-TTS, a novel architecture based on consistency models, achieves high-quality speech synthesis in fewer steps without adversarial training or pre-trained model dependencies. Weighted samplers are introduced to mitigate biases during model training.
LipVoicer is a novel method that generates high-quality and intelligible speech from silent videos by leveraging a lip-reading model to guide a diffusion-based speech generation model.
AugCondD improves speech quality under limited data conditions while maintaining comparable quality under sufficient data conditions.
Zero-shot TTS systems face challenges in prompting mechanisms, Mega-TTS 2 introduces a generic mechanism to tackle these challenges effectively.
提案されたEM-TTSモデルは、訓練時間とモデルパラメータを大幅に削減しながら、一定の合成品質を維持します。
Prompt and content characteristics significantly impact speech synthesis quality.