toplogo
Sign In

Curated Datasets and Neural Models for Machine Translation of Informal Mayan and Spanish Vernaculars


Core Concepts
This paper presents MayanV, a set of curated parallel corpora between various Mayan languages and Spanish, focusing on informal, day-to-day, and non-domain-specific language usage. The authors develop and evaluate neural machine translation models trained on these datasets, demonstrating significant improvements over baseline models that do not include the MayanV data.
Abstract
The paper provides an overview of the Mayan language family, highlighting its ancient history, cultural significance, and current status of underrepresentation in digital resources and global exposure. To address this gap, the authors have developed MayanV, a set of parallel corpora between Mayan languages and Spanish, focusing on informal, familial, and non-domain-specific language usage. The MayanV dataset was curated by manually extracting and cleaning resources from various online sources, primarily published by the Guatemalan Academy of Mayan Languages (ALMG). The authors performed a dialectometric analysis to characterize the Spanish dialect and register found in MayanV, observing considerable divergence from the more widespread written standard of Spanish. The authors then trained and evaluated bilingual and multilingual neural machine translation (NMT) models on the MayanV dataset, comparing their performance to baseline models trained on other available resources, such as the Bible and the Jehovah's Witnesses website. The results show that the models trained with the MayanV data significantly outperform the baselines, highlighting the importance of using resources that accurately reflect the common, real-life language usage of Mayan speakers. The authors conclude that the development of language technology tools for Mayan languages, such as the NMT models presented in this work, is crucial for promoting linguistic diversity and supporting the preservation and revitalization of these indigenous languages.
Stats
The Mayan languages have an ancient history, millions of speakers, and immense cultural value, but remain severely underrepresented in digital resources and global exposure. Only around half the population of ethnic Mayas are Mayan speakers, and the languages are often associated with backwardness, ignorance, and poverty. Mayan languages exhibit a high degree of dialectal variation, and code-switching with Spanish and other Mayan languages is common in areas of high contact. The MayanV dataset includes 15 Mayan languages, with the largest corpora being for Tzeltal (103,309 words) and Q'eqchi' (18,529 words). The Spanish found in MayanV is characterized by considerable dialectal divergence from the more widespread written standard, as evidenced by the dialectometric analysis.
Quotes
"Mayan languages, despite the total number of speakers, are considered to be somewhat in decline: according to Richards and Macario (2003), only around half the population of ethnic Mayas are Mayan speakers, and the languages are associated in many social contexts to backwardness, ignorance and poverty (England, 2003)." "Because of such scarcity of parallel resources for any Mayan language, especially those with just a few thousand, or even a few hundred, speakers, we use the parallel corpora we have built to train and evaluate a number of bilingual and multilingual NMT systems; in particular, multilingual systems have proven effective when dealing with low-resource and underrepresented languages (Lakew et al., 2018)."

Deeper Inquiries

What strategies could be employed to further expand the MayanV dataset and increase its coverage of Mayan languages and language varieties

To further expand the MayanV dataset and increase its coverage of Mayan languages and language varieties, several strategies can be employed: Collaboration with Indigenous Communities: Engage directly with Mayan communities to collect oral histories, traditional stories, and everyday conversations in various Mayan languages. This can help capture a wider range of language varieties and dialects. Crowdsourcing and Citizen Science: Utilize crowdsourcing platforms or citizen science initiatives to gather linguistic data from native speakers. This approach can help collect a large volume of data from diverse sources. Partnerships with Linguists and Anthropologists: Collaborate with linguists and anthropologists who specialize in Mayan languages to access existing resources, fieldwork data, and documentation. This can provide valuable insights and materials for dataset expansion. Integration of Speech Data: Incorporate speech data collection to include spoken language samples in the dataset. This can enhance the dataset's coverage of oral language use and pronunciation variations. Inclusion of Rarely Documented Varieties: Focus on documenting and including data from rarely documented or endangered Mayan language varieties to ensure their preservation and representation in the dataset. Continuous Data Curation: Regularly update and curate the dataset by adding new content, verifying data quality, and addressing any gaps or inconsistencies in the existing data.

How can the insights from the dialectometric analysis of the Spanish in MayanV be leveraged to improve the performance of the NMT models, particularly in handling code-switching and dialectal variation

The insights from the dialectometric analysis of the Spanish in MayanV can be leveraged to improve the performance of the NMT models in handling code-switching and dialectal variation in the following ways: Customized Training Data: Use the dialectometric analysis results to create customized training data subsets that specifically target the dialectal variations identified. This can help the models learn to differentiate between different Spanish dialects and adapt their translations accordingly. Fine-tuning Models: Fine-tune the NMT models using the dialectically diverse data subsets to enhance their ability to handle code-switching and dialectal nuances. This process can help the models better capture the linguistic variations present in the Mayan languages. Augmented Training Strategies: Implement augmented training strategies, such as data augmentation techniques that introduce dialectal variations and code-switching scenarios during training. This can expose the models to a wider range of language patterns and improve their robustness. Dialect-aware Evaluation Metrics: Develop dialect-aware evaluation metrics that consider the specific dialectal characteristics identified in the analysis. This can provide more nuanced insights into the models' performance and guide further improvements. Continuous Monitoring and Feedback: Continuously monitor the models' output and gather feedback from native speakers to refine the translations further. This iterative process can help fine-tune the models for better handling of dialectal variation and code-switching.

Given the cultural and historical significance of the Mayan languages, how can the development of language technology tools like the NMT models presented in this work be integrated with broader efforts to preserve and revitalize these indigenous languages

The development of language technology tools like the NMT models presented in this work can be integrated with broader efforts to preserve and revitalize Mayan languages in the following ways: Community Engagement: Involve Mayan communities in the development process to ensure that the tools meet their linguistic and cultural needs. This collaboration can foster a sense of ownership and empowerment among native speakers. Educational Initiatives: Integrate the NMT models into educational programs focused on Mayan language revitalization. These tools can support language learning efforts and promote the use of Mayan languages in various contexts. Cultural Documentation: Use the NMT models to translate and preserve traditional Mayan texts, songs, and cultural materials. This can contribute to the documentation and dissemination of cultural heritage. Accessibility and Inclusivity: Ensure that the NMT tools are accessible to a wide range of users, including those with limited technological resources. This can promote inclusivity and reach a broader audience of Mayan language speakers. Policy Advocacy: Advocate for policies that support the use of Mayan languages in official settings, education, and digital platforms. The NMT models can serve as evidence of the viability and importance of preserving these languages. Long-term Sustainability: Establish mechanisms for the continuous development and maintenance of the NMT tools to ensure their long-term sustainability and relevance in supporting Mayan language preservation efforts.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star