Core Concepts
This paper presents the first dataset for automatic speech recognition (ASR) in the endangered Kichwa language, containing 4 hours of audio with transcriptions, Spanish translations, and morphosyntactic annotations in Universal Dependencies format.
Abstract
The paper introduces the Killkan dataset, the first resource for automatic speech recognition (ASR) in the Kichwa language, an endangered indigenous language of Ecuador. The dataset contains approximately 4 hours of audio with transcriptions, Spanish translations, and morphosyntactic annotations in the Universal Dependencies format.
Key highlights:
Kichwa is an extremely low-resource language with no prior datasets available for natural language processing.
The audio data was retrieved from a publicly available radio program in Kichwa, and the transcriptions were manually corrected and aligned with the audio.
The dataset includes Spanish-Kichwa code-switching, which is common in spoken Kichwa, and the annotations capture this phenomenon.
The morphosyntactic annotations extend the Universal Dependencies guidelines to represent Kichwa-specific features like topic, focus, and switch-reference.
Experiments show that the dataset enables the development of the first reliable ASR system for Kichwa, achieving a character error rate of 2.04% when fine-tuning the wav2vec2 model.
The dataset, ASR model, and the code used to develop them will be publicly available, contributing to resource building and applications for low-resource languages.
Stats
The dataset contains approximately 4 hours of audio with 26,544 tokens.
The average length of a token is 6.12 characters.
The average sentence length is 6.76 tokens.
Quotes
"Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing."
"Our dataset, the ASR model, and the code used to develop them will be publicly available. Thus, our study positively showcases resource building and its applications for low-resource languages and their community."