RALL-E: A Robust Codec Language Model with Chain-of-Thought Prompting for Improved Text-to-Speech Synthesis
RALL-E improves the robustness of large language model-based text-to-speech synthesis by incorporating prosody tokens as chain-of-thought prompting and using duration-guided masking to enhance the alignment between phonemes and speech tokens.