The paper investigates methods to adapt the Llama 2 language model to the Estonian language, a low-resource scenario. The key insights are:
Continued pretraining of Llama 2 on Estonian and English data leads to performance gains on Estonian tasks compared to the base Llama 2 model.
Combining cross-lingual instruction-tuning with additional monolingual pretraining significantly enhances results on Estonian tasks. Even a relatively small amount of monolingual pretraining can improve performance.
Supplementing the instruction-tuning dataset with high-quality English instructions and conversations leads to positive cross-lingual knowledge transfer, resulting in improvements in commonsense reasoning and multi-turn conversation capabilities in Estonian.
The best model, named LLAMMAS, represents the first open-source instruction-following language model for Estonian. The authors also release Alpaca-est, the first general task instruction dataset for Estonian.
Experiments show that the addition of translation task instructions during fine-tuning can be beneficial, especially when no monolingual pretraining is performed. However, the benefits diminish when monolingual pretraining is included.
The authors evaluate their models on Estonian question answering, commonsense reasoning, machine translation, and grammatical error correction tasks, demonstrating competitive performance.
Para Outro Idioma
do conteúdo original
arxiv.org
Principais Insights Extraídos De
by Hele-Andra K... às arxiv.org 04-08-2024
https://arxiv.org/pdf/2404.04042.pdfPerguntas Mais Profundas