Core Concepts
This work introduces ManaTTS, the largest publicly available single-speaker Persian text-to-speech dataset, along with a comprehensive open-source processing pipeline for creating high-quality speech datasets for low-resource languages.
Abstract
This paper presents the ManaTTS dataset, which is the largest publicly accessible single-speaker Persian text-to-speech corpus, comprising approximately 86 hours of audio. The dataset was created by crawling and processing content from the Nasl-e-Mana magazine, a publication focused on the blind community.
The key highlights of this work include:
The dataset is distributed under the open CC-0 1.0 license, enabling free educational and commercial use.
The audio files have a sampling rate of 44.1 kHz and were recorded in a silent environment with minimal background noise.
The dataset covers a diverse range of topics and includes 24,113 unique words.
The authors developed a fully transparent, MIT-licensed processing pipeline that includes novel tools for sentence tokenization, audio segmentation, and a custom forced alignment method designed for low-resource languages.
The forced alignment tool utilizes multiple open-source Persian automatic speech recognition (ASR) models to generate reliable hypothesis transcripts, which are then matched to the ground truth text.
Experiments demonstrate the effectiveness of the ManaTTS dataset, with a Tacotron2-based TTS model trained on it achieving a Mean Opinion Score (MOS) of 3.76, which is remarkably close to the MOS of 3.86 for natural speech spectrograms and 4.01 for natural waveforms.
The authors also released the VirgoolInformal dataset, a smaller transcribed speech corpus used to evaluate and prioritize the ASR models employed in the forced alignment process.
The entire processing pipeline, including data crawling scripts and model training code, is made publicly available, ensuring the reproducibility of the results.
Stats
The ManaTTS dataset comprises approximately 86 hours and 24 minutes of processed and transcribed audio.
The dataset contains 64,834 accepted audio-text chunks, with an average of 11 words per chunk.
The dataset covers 24,113 unique words.
Quotes
"ManaTTS, the most extensive publicly accessible single-speaker Persian corpus, and a comprehensive framework for collecting transcribed speech datasets for the Persian language."
"The dataset can be easily extended thanks to the monthly growing Nasl-e-Mana magazine and the fully open pipeline."
"The samples synthesized by this model exhibited remarkable naturalness, comparing favorably to both the utterances generated from gold speech spectrograms and natural speech waveforms."