Core Concepts
EthioLLM introduces multilingual language models for five Ethiopian languages, addressing the lack of resources in low-resource languages.
Abstract
Abstract:
Introduction to EthioLLM and its significance in NLP.
Challenges faced by low-resource languages like Ethiopian languages.
Large Language Models:
Overview of transformer models like GPT, XLM-RoBERTa, mT5, and mBERT.
Importance of pre-trained language models in NLP tasks.
Afro-centric Models:
Development of models focusing on African languages.
Limitations in covering most Ethiopian languages.
EthioLLM:
Introduction of EthioLLM for five Ethiopian languages and English.
Creation of Ethiobenchmark dataset for downstream NLP tasks evaluation.
Related Works:
Previous research efforts on multilingual language models for low-resource languages.
Downstream Tasks and Datasets:
Evaluation of EthioLLM performance across various NLP tasks like news classification, machine translation, hate speech detection, sentiment analysis, named entity recognition, and part-of-speech tagging.
Results:
Comparison of EthioLLM performance with SOTA models in different tasks for Amharic, Oromo, Somali, Tigrinya, and Ge'ez languages.
Stats
Large language models have gained popularity recently due to their outstanding performance in various downstream Natural Language Processing (NLP) tasks.
Ethiopia has over 85 spoken languages but lacks pre-trained models and resources for most.
Afro-centric models aim to bridge the gap by focusing on African languages but have limitations in covering most Ethiopian languages.