toplogo
Sign In

Introducing ManaTTS: A Large Open-Source Persian Text-to-Speech Dataset and Processing Pipeline


Core Concepts
This work introduces ManaTTS, the largest publicly available single-speaker Persian text-to-speech dataset, along with a comprehensive open-source processing pipeline for creating high-quality speech datasets for low-resource languages.
Abstract
This paper presents the ManaTTS dataset, which is the largest publicly accessible single-speaker Persian text-to-speech corpus, comprising approximately 86 hours of audio. The dataset was created by crawling and processing content from the Nasl-e-Mana magazine, a publication focused on the blind community. The key highlights of this work include: The dataset is distributed under the open CC-0 1.0 license, enabling free educational and commercial use. The audio files have a sampling rate of 44.1 kHz and were recorded in a silent environment with minimal background noise. The dataset covers a diverse range of topics and includes 24,113 unique words. The authors developed a fully transparent, MIT-licensed processing pipeline that includes novel tools for sentence tokenization, audio segmentation, and a custom forced alignment method designed for low-resource languages. The forced alignment tool utilizes multiple open-source Persian automatic speech recognition (ASR) models to generate reliable hypothesis transcripts, which are then matched to the ground truth text. Experiments demonstrate the effectiveness of the ManaTTS dataset, with a Tacotron2-based TTS model trained on it achieving a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of 3.86 for natural speech spectrograms and 4.01 for natural waveforms. The authors also released the VirgoolInformal dataset, a smaller transcribed speech corpus used to evaluate and prioritize the ASR models employed in the forced alignment process. The entire processing pipeline, including data crawling scripts and model training code, is made publicly available, ensuring the reproducibility of the results.
Stats
The ManaTTS dataset comprises approximately 86 hours and 24 minutes of processed and transcribed audio. The dataset contains 64,834 accepted audio-text chunks, with an average of 11 words per chunk. The dataset covers 24,113 unique words.
Quotes
"ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive framework for collecting transcribed speech datasets for the Persian language." "The dataset can be easily extended thanks to the monthly growing Nasl-e-Mana magazine and the fully open pipeline." "The samples synthesized by this model exhibited remarkable naturalness, comparing favorably to both the utterances generated from gold speech spectrograms and natural speech waveforms."

Deeper Inquiries

How can the dataset be further extended and improved to cover a wider range of speakers, accents, and domains?

To extend and improve the ManaTTS dataset, several strategies can be employed. First, incorporating a diverse range of speakers is essential. This can be achieved by recruiting multiple speakers from different regions of Iran and other Persian-speaking communities, ensuring representation of various accents and dialects. This diversity will enhance the dataset's applicability across different user demographics and improve the TTS model's ability to generalize across various speech patterns. Second, expanding the dataset to include a broader range of domains is crucial. Currently, the dataset primarily draws from the Nasl-e-Mana magazine, which may limit the vocabulary and context. By integrating audio from various sources such as news articles, literature, educational materials, and conversational speech, the dataset can cover a wider array of topics and terminologies. This will not only enrich the vocabulary but also improve the model's performance in different contexts. Third, continuous updates to the dataset can be facilitated by leveraging the monthly publication of the Nasl-e-Mana magazine. Automating the data collection process through web scraping and integrating new audio files regularly can help maintain the dataset's relevance and size. Additionally, implementing a feedback mechanism where users can contribute audio samples or corrections can further enhance the dataset's quality and diversity. Finally, employing advanced data augmentation techniques can help simulate variations in speech, such as changes in pitch, speed, and background noise, thereby enriching the dataset without the need for extensive new recordings.

What are the potential challenges and ethical considerations in deploying a high-quality open-source Persian TTS system for accessibility and other applications?

Deploying a high-quality open-source Persian TTS system presents several challenges and ethical considerations. One significant challenge is ensuring the accuracy and reliability of the synthesized speech. Variability in pronunciation, intonation, and emotional expression can affect the naturalness of the output, particularly for users with specific accessibility needs, such as those with visual impairments. Continuous evaluation and improvement of the TTS model are necessary to address these concerns. Ethically, the use of a TTS system raises issues related to voice impersonation and privacy. The potential for misuse, such as creating deceptive audio that mimics a person's voice, necessitates the implementation of safeguards to prevent malicious applications. Anonymization techniques should be considered to protect the identities of the speakers in the dataset, ensuring that their voices cannot be easily replicated without consent. Moreover, the accessibility of the TTS system must be balanced with the need for responsible usage. Clear guidelines should be established regarding the ethical use of the technology, particularly in sensitive contexts such as education, healthcare, and public services. Ensuring that the TTS system is used to empower users rather than exploit them is paramount. Finally, there is a need for ongoing community engagement to address the evolving needs of users. Feedback from the visually impaired community and other stakeholders can guide the development of features that enhance usability and accessibility, ensuring that the TTS system serves its intended purpose effectively.

How can the forced alignment technique developed in this work be generalized and applied to create speech datasets for other low-resource languages?

The forced alignment technique developed in this work can be generalized and applied to create speech datasets for other low-resource languages by adapting its core principles to the specific linguistic characteristics and phonetic structures of those languages. The first step involves developing a robust transcription module that can handle the unique phonemes and intonational patterns of the target language. This may require training or fine-tuning existing ASR models on language-specific datasets to improve their accuracy. Next, the alignment process can be tailored to accommodate the linguistic features of the new language. For instance, the start-end alignment and forced alignment methods can be adjusted to account for variations in speech patterns, such as the presence of compound words or language-specific syntactic structures. Utilizing a combination of interval and gapped search methods, as demonstrated in the ManaTTS project, can enhance the flexibility of the alignment process, allowing it to handle discrepancies between spoken and written forms effectively. Furthermore, collaboration with native speakers and linguists can provide valuable insights into the nuances of the language, ensuring that the alignment process is culturally and contextually appropriate. This collaboration can also facilitate the collection of high-quality audio samples that reflect the diversity of speakers within the language community. Finally, the open-source nature of the tools and methodologies developed in this work allows for easy adaptation and sharing among researchers working on low-resource languages. By providing clear documentation and guidelines, the forced alignment technique can be disseminated widely, encouraging its application in various linguistic contexts and contributing to the development of high-quality speech datasets across different languages.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star