toplogo
Sign In

The Killkan Dataset: Automatic Speech Recognition for the Endangered Kichwa Language with Morphosyntactic Annotations


Core Concepts
This paper presents the first dataset for automatic speech recognition (ASR) in the endangered Kichwa language, containing 4 hours of audio with transcriptions, Spanish translations, and morphosyntactic annotations in Universal Dependencies format.
Abstract
The paper introduces the Killkan dataset, the first resource for automatic speech recognition (ASR) in the Kichwa language, an endangered indigenous language of Ecuador. The dataset contains approximately 4 hours of audio with transcriptions, Spanish translations, and morphosyntactic annotations in the Universal Dependencies format. Key highlights: Kichwa is an extremely low-resource language with no prior datasets available for natural language processing. The audio data was retrieved from a publicly available radio program in Kichwa, and the transcriptions were manually corrected and aligned with the audio. The dataset includes Spanish-Kichwa code-switching, which is common in spoken Kichwa, and the annotations capture this phenomenon. The morphosyntactic annotations extend the Universal Dependencies guidelines to represent Kichwa-specific features like topic, focus, and switch-reference. Experiments show that the dataset enables the development of the first reliable ASR system for Kichwa, achieving a character error rate of 2.04% when fine-tuning the wav2vec2 model. The dataset, ASR model, and the code used to develop them will be publicly available, contributing to resource building and applications for low-resource languages.
Stats
The dataset contains approximately 4 hours of audio with 26,544 tokens. The average length of a token is 6.12 characters. The average sentence length is 6.76 tokens.
Quotes
"Kichwa is an extremely low-resource endangered language, and there have been no resources before Killkan for Kichwa to be incorporated in applications of natural language processing." "Our dataset, the ASR model, and the code used to develop them will be publicly available. Thus, our study positively showcases resource building and its applications for low-resource languages and their community."

Deeper Inquiries

How can the Killkan dataset and ASR model be leveraged to support language revitalization efforts for Kichwa?

The Killkan dataset and ASR model play a crucial role in supporting language revitalization efforts for Kichwa by providing essential resources for preserving and promoting the language. Firstly, the dataset contains audio recordings, transcriptions, translations, and morphosyntactic annotations in Kichwa, enabling researchers and linguists to study the language's structure, morphology, and syntax. This linguistic analysis can aid in creating educational materials, language learning tools, and curriculum development for Kichwa speakers and learners. Additionally, the ASR model allows for the automatic transcription of spoken Kichwa, facilitating the documentation of oral traditions, storytelling, and cultural practices in the language. By digitizing and preserving Kichwa language data, the dataset and ASR model contribute to the long-term sustainability and visibility of Kichwa within the digital landscape.

What are the potential challenges in deploying the Kichwa ASR system in real-world applications, and how can they be addressed?

Deploying the Kichwa ASR system in real-world applications may face several challenges that need to be addressed for successful implementation. One challenge is the variability in spoken Kichwa dialects and accents, which can affect the accuracy of the ASR system. To mitigate this challenge, the ASR model can be fine-tuned with diverse dialectal data to improve its robustness and adaptability to different linguistic variations. Another challenge is the presence of code-switching between Spanish and Kichwa in spoken language, leading to errors in transcription. Addressing this challenge requires training the ASR model on a more extensive code-switched dataset and implementing language identification algorithms to differentiate between the two languages accurately. Furthermore, ensuring the accessibility of the ASR technology to Kichwa-speaking communities, including those with limited digital literacy, requires user-friendly interfaces, multilingual support, and community engagement strategies to promote adoption and usage.

Given the high rate of Spanish-Kichwa code-switching, how can future research explore the interplay between the two languages and its implications for language technology development?

Future research can explore the interplay between Spanish and Kichwa code-switching to enhance language technology development and promote bilingualism in indigenous communities. One avenue of exploration is developing code-switching detection models that can automatically identify and analyze code-switched segments in speech data. By understanding the patterns and linguistic features of code-switching, researchers can improve ASR models' accuracy in transcribing mixed-language utterances. Additionally, studying the sociolinguistic factors influencing code-switching behavior can provide insights into language attitudes, identity, and language maintenance strategies within bilingual communities. This research can inform the design of culturally sensitive language technologies that respect and preserve the linguistic diversity of indigenous languages like Kichwa. Ultimately, by embracing and leveraging code-switching as a linguistic phenomenon, language technology development can better serve the needs and preferences of multilingual speakers in diverse language contexts.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star