Concetti Chiave
Integrating signal processing cues with deep learning techniques can produce accurate phone alignments, leading to better duration modeling and higher-quality text-to-speech synthesis for Indian languages.
Sintesi
This paper presents a method for developing high-quality end-to-end (E2E) text-to-speech (TTS) systems for 13 Indian languages by seamlessly integrating signal processing cues with deep learning techniques. The focus is on improving the duration prediction and thereby the synthesis quality of the E2E TTS systems by correcting the phone alignments of the training data.
The authors use the FastSpeech2 architecture as the mel-spectrogram generation model and the HiFi-GAN vocoder for speech reconstruction. They compare the performance of systems trained with different alignment techniques: a teacher model, Montreal Forced Aligner (MFA), and a hybrid HMM-GD-DNN segmentation (HS) approach that combines signal processing cues and deep learning.
Experiments on the Hindi male dataset show that the HS-based system outperforms the other alignment approaches, especially in low-resource scenarios. The authors also evaluate the proposed systems against the existing best TTS systems available for 13 Indian languages and find that the HS-based systems perform better on average, with a 62.63% preference from native listeners.
The key highlights are:
- Accurate phone alignments obtained using signal processing cues in tandem with deep learning lead to better duration modeling and higher-quality synthesis.
- The HS-based FastSpeech2 system outperforms systems trained with teacher model and MFA alignments, especially in low-resource scenarios.
- The proposed systems perform better than the existing state-of-the-art TTS systems for 13 Indian languages on average.
Statistiche
The average absolute boundary difference between manual alignments and those obtained using MFA is 11.88 ms, while for the HS approach it is 4.40 ms.
The MCD score for the HS-based system is 6.58, comparable to 6.56 for the teacher model and 6.61 for the MFA system on the Hindi male dataset with full data.
In the low-resource scenario with 1 hour of data, the MCD scores are 7.21 for HS, 7.30 for MFA, and 6.95 for the VITS model.
Citazioni
"Integrating signal processing cues with deep learning techniques can produce accurate phone alignments, leading to better duration modeling and higher-quality text-to-speech synthesis for Indian languages."
"The HS-based FastSpeech2 system outperforms systems trained with teacher model and MFA alignments, especially in low-resource scenarios."
"The proposed systems perform better than the existing state-of-the-art TTS systems for 13 Indian languages on average."