The paper introduces IndicVoices-R, a large-scale multilingual text-to-speech (TTS) dataset for Indian languages. The dataset is derived from the existing IndicVoices automatic speech recognition (ASR) corpus, which covers 22 Indian languages and includes both read-speech and conversational recordings.
To enhance the quality of the IndicVoices dataset for TTS, the authors employ a comprehensive data pipeline that involves demixing, dereverberation, and denoising of the audio samples. This process results in IndicVoices-R, a dataset that matches the speech and sound quality of gold-standard TTS datasets like LJSpeech, LibriTTS, and IndicTTS, as measured by various speech quality metrics.
IndicVoices-R surpasses existing Indian TTS datasets in terms of scale, with 1,704 hours of speech data from 10,496 speakers. It also exhibits rich diversity in speaker demographics, age, gender, and speaking styles, which is crucial for achieving good cross-speaker generalization in TTS models.
To complement the dataset, the authors introduce the IndicVoices-R Benchmark, a carefully designed evaluation framework to assess the zero-shot, few-shot, and many-shot speaker generalization capabilities of TTS models on Indian voices. They demonstrate that fine-tuning an English pre-trained model on IndicVoices-R leads to better zero-shot speaker generalization compared to fine-tuning on the IndicTTS dataset alone.
The authors open-source the IndicVoices-R dataset and the first TTS model that supports all 22 official Indian languages, paving the way for the development of more robust and versatile TTS systems for Indian languages.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문