Improving Zero-Shot Text-to-Speech Synthesis with Multi-Scale Acoustic Prompts
A novel zero-shot text-to-speech model that utilizes multi-scale acoustic prompts, including a style prompt to capture personal speaking style and a timbre prompt to preserve the target speaker's voice characteristics, outperforming state-of-the-art language model-based approaches in terms of naturalness and speaker similarity.