Core Concepts
The author developed the gaHealth corpus to address the lack of parallel data for low-resource languages in the health domain, showcasing a significant improvement in translation models.
Abstract
The gaHealth corpus was created to enhance Machine Translation models for low-resource languages, focusing on health data. By utilizing professionally translated documents and Covid-related data, the corpus demonstrated a 40% increase in BLEU score compared to other models. The development process involved extracting, cleaning, and aligning bilingual text files from various sources. The toolchain used for development ensured high-quality corpora by normalizing characters, detecting languages, and aligning sentences accurately. Guidelines were established to streamline the conversion process of PDF documents into a sentence-aligned corpus. The Transformer architecture was employed with optimized hyperparameters to train models for English-Irish and Irish-English translation in the health domain. Automated metrics like BLEU, TER, and ChrF were used to evaluate translation quality, showing promising results with gaHealth models outperforming other systems significantly.
Stats
Models using gaHealth showed a maximum BLEU score improvement of 40%.
The ga2en model achieved a BLEU score of 57.6.
The en2ga* system reached a BLEU score of 37.6.
Quotes
"Developing smaller in-domain datasets can yield significant benefits overlooked by generic translation approaches."
"Our study outlines the process used in developing the corpus and empirically demonstrates the benefits of using an in-domain dataset for health data."
"The availability of large amounts of textual data is fundamental to NLP applications' success."