The gaHealth corpus was created to enhance Machine Translation models for low-resource languages, focusing on health data. By utilizing professionally translated documents and Covid-related data, the corpus demonstrated a 40% increase in BLEU score compared to other models. The development process involved extracting, cleaning, and aligning bilingual text files from various sources. The toolchain used for development ensured high-quality corpora by normalizing characters, detecting languages, and aligning sentences accurately. Guidelines were established to streamline the conversion process of PDF documents into a sentence-aligned corpus. The Transformer architecture was employed with optimized hyperparameters to train models for English-Irish and Irish-English translation in the health domain. Automated metrics like BLEU, TER, and ChrF were used to evaluate translation quality, showing promising results with gaHealth models outperforming other systems significantly.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Séam... lúc arxiv.org 03-07-2024
https://arxiv.org/pdf/2403.03575.pdfYêu cầu sâu hơn