The paper introduces VoxHakka, a text-to-speech (TTS) system for Taiwanese Hakka, a language with significant dialectal diversity. Taiwanese Hakka has six major dialects, and the language features complex phonology, including seven tones, diverse vowels, and syllable-final consonants.
To address the scarcity of publicly available Hakka speech corpora, the researchers employed a cost-effective approach using web scraping and automatic speech recognition (ASR)-based data cleaning techniques. This process resulted in a high-quality, multi-speaker, multi-dialect dataset suitable for TTS training.
The VoxHakka system is built upon the YourTTS framework, which enables high-quality and efficient speech synthesis while supporting zero-shot capabilities for unseen speakers and languages. The model is trained with dialect-specific data, allowing for the generation of speaker-aware Hakka speech.
Subjective listening tests using Comparative Mean Opinion Scores (CMOS) demonstrate that VoxHakka significantly outperforms existing publicly available Hakka TTS systems in terms of pronunciation accuracy, tone correctness, and overall naturalness. This work represents a significant advancement in Hakka language technology and provides a valuable resource for language preservation and revitalization efforts.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문