The FeruzaSpeech corpus is a significant contribution to the field of Uzbek speech technology. It provides 60 hours of high-quality, single-channel, 16-bit audio recordings at 16kHz, with transcripts in both the Cyrillic and Latin alphabets. The corpus consists of excerpts from a classic Uzbek novel, "Choliqushi," and BBC Uzbek news articles, read by a native female speaker from Tashkent, Uzbekistan.
The dataset is divided into training, development, and test sets, with the training set including both the book and news excerpts, and the development and test sets containing only the news excerpts. The audio segments are longer than those in other Uzbek speech corpora, with an average length of 16.39 seconds, containing one to two full sentences.
Experiments conducted using the FeruzaSpeech corpus, in combination with the existing Common Voice Uzbek Dataset and Uzbek Speech Corpus, have shown significant improvements in word error rates (WERs) for automatic speech recognition (ASR) models. The best WER on the Common Voice test set was 11.17%, the best WER on the FeruzaSpeech test set was 4.05%, and the best WER on the Uzbek Speech Corpus test set was 11.67%.
The availability of dual-alphabet transcripts in FeruzaSpeech is a unique feature, as it addresses the challenge of accurately converting between the Cyrillic and Latin alphabets used in Uzbekistan. This corpus complements the existing Uzbek speech datasets and is expected to contribute to the advancement of speech recognition and synthesis technologies for the Uzbek language.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы