The paper discusses the creation of the EthioMT parallel corpus, which covers 15 Ethiopian languages from the Afro-Asiatic and Nilo-Saharan language families. It provides details on the languages included, their language families, number of speakers, and dataset sizes.
The authors collected datasets for the languages, primarily from religious domains, and aligned the sentences with their English translations. They then preprocessed the data and split it into training, development, and test sets.
To evaluate the usefulness of the new corpus, the authors conducted baseline experiments using two approaches: a transformer model and fine-tuning a multilingual M2M100-48 model. The results show that the fine-tuning approach outperformed the transformer model in both translation directions (from English to Ethiopian languages and vice versa). The performance was better for languages with larger dataset sizes, such as Amharic, Afaan Oromo, Somali, and Tigrinya, compared to languages with smaller datasets.
The authors conclude that the EthioMT corpus can foster collaboration and facilitate research and development in low-resource Ethiopian languages. They plan to expand the corpus size and explore additional machine translation approaches in the future.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問