Sign In

ÌròyìnSpeech: Yorùbá Speech Corpus for TTS and ASR

Core Concepts
Increasing high-quality Yorùbá speech data for Text-to-Speech and Automatic Speech Recognition tasks.
1. Introduction Lack of voice-enabled applications in African languages. Efforts to build multilingual datasets for speech recognition. Focus on Yorùbá language with limited contemporary speech research. 2. The Yorùbá Language Description of the Yorùbá language and its characteristics. 3. The ÌròyìnSpeech Corpus Preparation of text sentences from news and fictional texts. Recording of text sentences for ASR and TTS. Post-production processes to ensure quality. 4. Speech Synthesis Experiments Training speech synthesis models with different approaches. Evaluation of models with subjective and objective metrics. Impact of training with diacritics on synthesized voices. 5. Automatic Speech Recognition Experiments Training baseline ASR models and exploring different approaches. Comparison of end-to-end Conformer model and finetuned wav2vec 2.0. 6. Conclusion Presentation of the Yorùbá speech dataset for research. Findings on speech synthesis and ASR experiments. 7. Ethics Statement Consent of volunteers and privacy considerations. 8. Acknowledgements Recognition of support and funding for the project. 9. Bibliographical References List of references for further reading.
We curated about 23,000 text sentences. We created about 42 hours of speech data recorded by 80 volunteers. For ASR, we obtained a baseline word error rate (WER) of 23.8.
"We introduce ÌròyìnSpeech, a new corpus influenced by the desire to increase the amount of high quality, contemporary Yorùbá speech data." "Our TTS evaluation suggests that a high-fidelity, general domain, single-speaker Yorùbá voice is possible with as little as 5 hours of speech."

Key Insights Distilled From

by Tolulope Ogu... at 03-28-2024

Deeper Inquiries

How can the ÌròyìnSpeech dataset be utilized beyond TTS and ASR applications?

The ÌròyìnSpeech dataset can be utilized beyond Text-to-Speech (TTS) and Automatic Speech Recognition (ASR) applications in various ways. One significant application is in linguistic research and analysis. Researchers can use the dataset to study the phonetics, tonality, and linguistic nuances of the Yorùbá language. It can also be used for language learning and education purposes, providing authentic audio samples for students to practice listening and pronunciation. Additionally, the dataset can support the development of language-related technologies such as language identification, sentiment analysis, and emotion recognition in Yorùbá speech. Cultural preservation projects could leverage the dataset to create interactive language learning tools, digital archives of oral traditions, and multimedia experiences that showcase the richness of the Yorùbá language and culture.

What are the potential drawbacks of relying on crowd-sourced recordings for speech data?

While crowd-sourced recordings offer many benefits, such as scalability, diversity, and cost-effectiveness, there are potential drawbacks to consider. One major concern is the quality control of the recordings. Since crowd-sourced data relies on contributions from volunteers, there may be inconsistencies in recording quality, background noise, accent variations, and pronunciation accuracy. Ensuring the accuracy and reliability of the data can be challenging, especially when dealing with languages with specific tonal or phonetic requirements like Yorùbá. Another drawback is the potential lack of linguistic expertise among the volunteers, which could lead to mispronunciations, errors in intonation, or improper emphasis on certain words or phrases. Additionally, managing the ethical considerations of using crowd-sourced data, such as ensuring informed consent, protecting user privacy, and handling sensitive information, can pose challenges for researchers and project managers.

How can the findings of this research contribute to preserving and promoting the Yorùbá language and culture?

The findings of this research can contribute significantly to preserving and promoting the Yorùbá language and culture in several ways. By creating a high-quality speech corpus like ÌròyìnSpeech, researchers and developers can advance the development of language technologies that support Yorùbá, making it more accessible in digital environments. This can help bridge the gap between traditional oral communication and modern digital platforms, preserving the language for future generations. Furthermore, the research outcomes can raise awareness about the importance of linguistic diversity and cultural heritage, encouraging initiatives to protect and celebrate the Yorùbá language. The availability of resources like the ÌròyìnSpeech dataset can empower Yorùbá speakers to engage with their language in new ways, fostering a sense of pride and identity in their linguistic and cultural heritage.