מושגי ליבה
Leveraging the interplay between syntactic and acoustic cues to enhance pause prediction and placement for more natural Korean text-to-speech synthesis, even for longer and more complex sentences.
תקציר
The paper proposes a novel framework, TaKOtron2-Pro, that integrates both syntactic and acoustic information to improve pause modeling for Korean text-to-speech (TTS) synthesis.
Key highlights:
- Syntactic features are extracted using a combination of local context modeling (CBHL) and global constituent parsing (NCP), capturing both local and global linguistic relationships.
- Acoustic features are learned in an unsupervised manner using a target acoustic embedding (TAE) predicted by the text encoder.
- The integration of syntactic and acoustic cues enables TaKOtron2-Pro to accurately and robustly insert pauses, even for longer and more complex sentences that are out-of-domain compared to the training data.
- Evaluations show that TaKOtron2-Pro significantly outperforms baseline TTS models in terms of mean opinion score (MOS), ABX preference, and word error rate (WER), especially for longer utterances.
- Ablation studies confirm the importance of leveraging both syntactic and acoustic information for effective pause modeling in Korean TTS.
סטטיסטיקה
The Korean KSS dataset used for training has an average audio length of 2.38 seconds.
The average length of the longer, out-of-domain sentences used for evaluation is 22 words per sentence.
ציטוטים
"Towards the enhancement of verbal fluency in synthetic Korean speech, we concentrate on refining the modeling of respiratory pauses."
"Remarkably, our framework possesses the capability to consistently generate natural speech even for considerably more extended and intricate out-of-domain (OOD) sentences, despite its training on short audio clips."
"Architectural design choices are validated through comparisons with baseline models and ablation studies using subjective and objective metrics, thus confirming model performance."