The paper investigates methods to adapt the Llama 2 language model to the Estonian language, a low-resource scenario. The key insights are:
Continued pretraining of Llama 2 on Estonian and English data leads to performance gains on Estonian tasks compared to the base Llama 2 model.
Combining cross-lingual instruction-tuning with additional monolingual pretraining significantly enhances results on Estonian tasks. Even a relatively small amount of monolingual pretraining can improve performance.
Supplementing the instruction-tuning dataset with high-quality English instructions and conversations leads to positive cross-lingual knowledge transfer, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities in Estonian.
The best model, named LLAMMAS, represents the first open-source instruction-following language model for Estonian. The authors also release Alpaca-est, the first general task instruction dataset for Estonian.
Experiments show that the addition of translation task instructions during fine-tuning can be beneficial, especially when no monolingual pretraining is performed. However, the benefits diminish when monolingual pretraining is included.
The authors evaluate their models on Estonian question answering, commonsense reasoning, machine translation, and grammatical error correction tasks, demonstrating competitive performance.
Egy másik nyelvre
a forrásanyagból
arxiv.org
Mélyebb kérdések