Core Concepts
This paper explores cost-efficient methods to adapt the Llama 2 language model to the Estonian language, leveraging cross-lingual instruction-tuning and additional monolingual pretraining.
Abstract
The paper investigates methods to adapt the Llama 2 language model to the Estonian language, a low-resource scenario. The key insights are:
Continued pretraining of Llama 2 on Estonian and English data leads to performance gains on Estonian tasks compared to the base Llama 2 model.
Combining cross-lingual instruction-tuning with additional monolingual pretraining significantly enhances results on Estonian tasks. Even a relatively small amount of monolingual pretraining can improve performance.
Supplementing the instruction-tuning dataset with high-quality English instructions and conversations leads to positive cross-lingual knowledge transfer, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities in Estonian.
The best model, named LLAMMAS, represents the first open-source instruction-following language model for Estonian. The authors also release Alpaca-est, the first general task instruction dataset for Estonian.
Experiments show that the addition of translation task instructions during fine-tuning can be beneficial, especially when no monolingual pretraining is performed. However, the benefits diminish when monolingual pretraining is included.
The authors evaluate their models on Estonian question answering, commonsense reasoning, machine translation, and grammatical error correction tasks, demonstrating competitive performance.
Stats
Even a relatively small amount of additional monolingual pretraining (1B tokens) leads to performance gains on Estonian tasks compared to the base Llama 2 model.
Pretraining Llama 2 on 5B tokens of Estonian and English data further improves results on Estonian tasks.
Quotes
"This paper explores cost-efficient methods to adapt pretrained Large Language Models (LLMs) to new lower-resource languages, with a specific focus on Estonian."
"Our results demonstrate that even a relatively small amount of additional monolingual pretraining followed by cross-lingual instruction-tuning significantly enhances results on Estonian."
"Furthermore, we showcase cross-lingual knowledge transfer from high-quality English instructions to Estonian, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities."