Core Concepts
This paper introduces the MedEV dataset, a high-quality Vietnamese-English parallel corpus containing 358.7K sentence pairs in the medical domain, and conducts a comprehensive empirical investigation to improve the performance of neural machine translation models within the medical health domain.
Abstract
The paper presents the development of the MedEV dataset, a high-quality Vietnamese-English parallel corpus in the medical domain. The dataset was constructed by collecting parallel document pairs from various sources, including scientific article abstracts, MSD Manuals, thesis summaries, and article translations. The authors performed sentence alignment, data cleaning, and quality verification to ensure the dataset's high quality.
The paper then conducts an extensive evaluation of various machine translation models on the MedEV dataset, including Google Translate, ChatGPT, state-of-the-art Vietnamese-English NMT models, and pre-trained bilingual/multilingual sequence-to-sequence models. The results show that fine-tuning the vinai-translate model on the MedEV dataset achieves the best performance, outperforming Google Translate by a substantial margin in both English-to-Vietnamese and Vietnamese-to-English translation.
The authors also analyze the translation performance across different sentence length buckets and resource genres, finding that the MSD Manuals genre exhibits the highest BLEU scores, followed by Thesis Summaries and Article Translations. The Article Abstracts genre, which contains more medical terminology, shows lower BLEU scores.
The paper concludes by publicly releasing the MedEV dataset to promote further research in Vietnamese-English medical machine translation.
Stats
The MedEV dataset contains 358,796 parallel sentence pairs, with an average of 25.09 word tokens per English sentence and 33.76 syllable tokens per Vietnamese sentence.
The dataset is split into 340,897 training pairs, 8,939 validation pairs, and 8,960 test pairs.
Quotes
"To the best of our knowledge, this marks the first empirical study focusing on Vietnamese-English medical machine translation."
"We make the MedEV dataset publicly available for research and educational purposes."