The gaHealth corpus was created to enhance Machine Translation models for low-resource languages, focusing on health data. By utilizing professionally translated documents and Covid-related data, the corpus demonstrated a 40% increase in BLEU score compared to other models. The development process involved extracting, cleaning, and aligning bilingual text files from various sources. The toolchain used for development ensured high-quality corpora by normalizing characters, detecting languages, and aligning sentences accurately. Guidelines were established to streamline the conversion process of PDF documents into a sentence-aligned corpus. The Transformer architecture was employed with optimized hyperparameters to train models for English-Irish and Irish-English translation in the health domain. Automated metrics like BLEU, TER, and ChrF were used to evaluate translation quality, showing promising results with gaHealth models outperforming other systems significantly.
A otro idioma
del contenido fuente
arxiv.org
Consultas más profundas