toplogo
Sign In

A Novel Parallel Dataset and Machine Translation System for the Tulu Language


Core Concepts
This study introduces the first parallel dataset for English-Tulu translation and develops a machine translation system for this low-resource language by leveraging resources from the related Kannada language.
Abstract
The authors present the first parallel dataset for English-Tulu translation by extending the FLORES-200 dataset with human translations into Tulu. They collaborated with the Jai Tulunad organization, a volunteer group dedicated to preserving Tulu language and culture, to obtain the translations. The authors then develop a machine translation system for English-Tulu using a transfer learning approach. They leverage resources available for the related South Dravidian language, Kannada, to train their model without parallel English-Tulu data. The key steps include: Fine-tuning a pre-trained IndicBARTSS model to translate from Kannada to English, and using this model to back-translate the Tulu monolingual data. Training an English-Tulu model using the back-translated pairs, parallel English-Kannada data, and denoising autoencoding. Further fine-tuning the models using the parallel English-Tulu data from the DravidianLangTech-2022 shared task. The authors' English-Tulu model achieves a BLEU score of 35.41, significantly outperforming Google Translate, which scored 7.19 on the same test set. However, the authors note several limitations, including the absence of an adversarial training step and the relatively small size of the Tulu monolingual dataset.
Stats
Tulu has around 2.5 million speakers, predominantly in the southwestern region of India. The Tulu Wikipedia contains 1,894 articles, from which the authors extracted a monolingual Tulu corpus of 40,000 sentences. The DravidianLangTech-2022 shared task provided a parallel Kannada-Tulu dataset of 8,300 sentences.
Quotes
"Tulu, classified within the South Dravidian linguistic family branch, is predominantly spoken by approximately 2.5 million individuals in southwestern India." "Without access to parallel EN–TCY data, we developed this system using a transfer learning (Zoph et al., 2016) to address translation challenges in this low-resource language." "Our English–Tulu system, trained without using parallel English–Tulu data, outperforms Google Translate by 19 BLEU points (in September 2023)."

Key Insights Distilled From

by Manu... at arxiv.org 03-29-2024

https://arxiv.org/pdf/2403.19142.pdf
A Tulu Resource for Machine Translation

Deeper Inquiries

How can the authors further improve the performance of their English-Tulu machine translation model, beyond the current BLEU score of 35.41?

To further enhance the performance of their English-Tulu machine translation model, the authors can consider several strategies: Fine-tuning with More Parallel Data: Increasing the amount of parallel data available for English-Tulu translation can help the model learn more effectively. This additional data can provide more diverse examples for the model to learn from, improving its ability to handle various linguistic nuances. Domain-Specific Training: Training the model on domain-specific data can improve its performance for specialized topics or industries. By fine-tuning the model on specific domains such as legal, medical, or technical texts, the translation quality can be optimized for those areas. Model Architecture Optimization: Experimenting with different transformer architectures, such as larger models or models with specific modifications for low-resource languages, can potentially improve translation quality. Fine-tuning hyperparameters and exploring different model configurations may lead to better results. Data Augmentation Techniques: Implementing data augmentation techniques, such as back-translation, synthetic data generation, or data filtering, can help improve the robustness and generalization of the model. By increasing the diversity of the training data, the model can learn to handle a wider range of input variations. Post-Editing and Human Evaluation: Conducting post-editing and human evaluation of the translations can provide valuable feedback on the model's performance. Incorporating human feedback into the training process can help identify specific areas for improvement and guide further model refinement.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star