Core Concepts
Low-resource languages like Amharic can benefit from integrating task-specific and generative datasets to enhance language model performance.
Abstract
Abstract:
Large language models (LLMs) excel in understanding and generating human languages.
Low-resource languages like Amharic lack resources for model enhancement.
This work focuses on improving the LLAMA-2-Amharic model by integrating task-specific and generative datasets.
Introduction:
LLMs like GPT series demonstrate exceptional linguistic comprehension and text generation abilities.
LLAMA-2 pre-training supports limited languages, excluding low-resource ones like Amharic.
Adapting LLMs to low-resource languages is challenging due to the lack of quality instruction datasets.
Related Work:
Open-source LLMs enable specialized language models for specific applications.
Techniques like LoRA and QLoRA offer efficient methods for training large language models.
Dataset Preparation:
Creation of instruction-based datasets from existing NLP task-specific datasets.
Introduction of new custom datasets for generation tasks in Amharic.
Experiments:
Evaluation of existing and fine-tuned models using different dataset combinations.
Exploration of prompts' impact on model performance in Amharic tasks.
Results:
Improvement in classification, generation, and machine translation tasks with curated datasets.
Human evaluation shows enhanced generative capabilities with specific datasets.
Conclusion and Future Works:
Integration of human annotated instruction datasets for further model evaluation.
Stats
"Amharic is one of the Semitic languages under the Afroasiatic language family spoken in Ethiopia with more than 57M speakers."
"The result shows a significant enhancement of the model’s ability to comprehend and execute instructions."
"We used datasets from LLAMA-2-Amharic, Alpaca, and dolly datasets."