Core Concepts
This paper introduces four multilingual pre-trained language models (PLMs) tailored for five Angolan languages using a Multilingual Adaptive Fine-tuning (MAFT) approach. The authors demonstrate that employing informed embedding initialization through the OFA method and incorporating synthetic data significantly enhances the performance of the MAFT models on downstream tasks.
Abstract
The paper addresses the lack of representation of Angolan languages in the development of multilingual language models. It introduces four tailored PLMs for five Angolan languages - Chokwe, Kimbundu, Kikongo, Luba-Kasai, and Umbundu - using the MAFT approach.
The authors compare the performance of the MAFT models with and without informed embedding initialization, denoted as ANGXLM-R and ANGOFA, respectively. They find that ANGOFA, which leverages the OFA approach for embedding initialization and incorporates synthetic data, significantly outperforms ANGXLM-R and other baselines.
Key highlights:
Region-specific PLMs covering related languages within the same family can be more effective than pre-training on many languages from scratch.
Incorporating synthetic data can boost the performance of MAFT models.
OFA embedding initialization is superior to random initialization, and its advantage is further amplified by access to larger training data through synthetic corpus.
ANGOFA, the MAFT model with OFA initialization and synthetic data, achieves the best overall performance, outperforming XLM-R by 16.6 points, AfroXLMR by 12.3 points, and ANGXLM-R (with synthetic data) by 5.6 points on the SIB-200 text classification benchmark.
Stats
The NLLB dataset (excluding English translations) was used as the monolingual pre-training corpus, totaling 281.6 MB.
Synthetic data generated through the NLLB-600M machine translation model was added, resulting in a combined corpus of 808.6 MB.
The SIB-200 text classification dataset, covering 7 classes in over 200 African languages and dialects, was used for evaluation.
Quotes
"Region-specific PLMs covering related languages within the same family can be more effective than pre-training on many languages from scratch."
"Incorporating synthetic data can boost the performance of MAFT models."
"OFA embedding initialization is superior to random initialization, and its advantage is further amplified by access to larger training data through synthetic corpus."