Keskeiset käsitteet
A two-phase approach to build a high-performing English-Ukrainian machine translation system by leveraging supervised finetuning on a large noisy dataset and unsupervised data selection on a high-quality dataset.
Tiivistelmä
The authors present a recipe for building an English-Ukrainian machine translation system using a two-phase approach.
In the first phase, they perform supervised finetuning of a large pretrained language model (Mistral-7B-v0.1) on a noisy parallel dataset of 3 million English-Ukrainian sentence pairs. They apply various heuristic filters to the publicly available Paracrawl dataset to control the quality of the training data, including language detection, perplexity thresholding, translation mismatch filtering, and length filtering.
In the second phase, the authors further finetune the model on the high-quality Extended Multi30K dataset. They employ unsupervised perplexity-based data selection, using k-fold cross-validation to identify and remove the most surprising sentences from the training set. This second phase provides an additional performance boost.
The authors' final model, named Dragoman, outperforms previous state-of-the-art encoder-decoder models on the FLORES-101 English-Ukrainian devtest set, achieving 32.3 BLEU. They also explore few-shot translation using pretrained models, but find that their finetuned Dragoman model still outperforms these approaches.
The authors discuss the limitations of their work, including the challenges of tokenization for Ukrainian, the choice of evaluation metric (BLEU), and the need for more nuanced translation quality assessment. They also highlight potential future directions, such as exploring the stability of long-context attention and incorporating community-informed metrics for Ukrainian.
Tilastot
The Paracrawl dataset contains 13,354,365 English-Ukrainian sentence pairs, with significant noise and quality issues.
After applying heuristic filters, the authors obtain subsets of 1 million, 3 million, and 8 million sentence pairs.
The Extended Multi30K dataset contains 29,000 high-quality English-Ukrainian sentence pairs.
Lainaukset
"To build large language models for Ukrainian we need to expand our corpora with large amounts of new algorithmic tasks expressed in natural language."
"Our decoder-only model named Dragoman beats performance of previous state of the art encoder-decoder models on the FLORES devtest set."