The gaHealth corpus was created to enhance Machine Translation models for low-resource languages, focusing on health data. By utilizing professionally translated documents and Covid-related data, the corpus demonstrated a 40% increase in BLEU score compared to other models. The development process involved extracting, cleaning, and aligning bilingual text files from various sources. The toolchain used for development ensured high-quality corpora by normalizing characters, detecting languages, and aligning sentences accurately. Guidelines were established to streamline the conversion process of PDF documents into a sentence-aligned corpus. The Transformer architecture was employed with optimized hyperparameters to train models for English-Irish and Irish-English translation in the health domain. Automated metrics like BLEU, TER, and ChrF were used to evaluate translation quality, showing promising results with gaHealth models outperforming other systems significantly.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Séam... kl. arxiv.org 03-07-2024
https://arxiv.org/pdf/2403.03575.pdfDybere Forespørgsler