Core Concepts
Amharic LLaMA and LLaVA aim to enhance language models for low resource languages like Amharic through data augmentation and multimodal capabilities.
Abstract
Abstract:
Large Language Models (LLMs) excel at natural language processing tasks.
LLMs struggle with low-resource languages like Amharic due to limited training data.
LLaMA-2 is trained to understand Amharic using data augmentation and multimodal capabilities.
Introduction:
Transformer architecture revolutionized natural language processing.
LLaMA and LLaVA are open source models enhancing language understanding.
Multimodal capabilities added to LLaMA for image understanding.
Data:
Data augmentation through machine translation to create diverse Amharic tokens.
Combined dataset of 436 million tokens from public sources and 3.348 billion translated tokens.
Experiments:
Pretraining and fine-tuning on A100 GPU for improved performance.
Exploration of different dataset versions and visual instruction tuning.
Results and Evaluation:
Improved performance on text and visual tasks after fine-tuning.
Models outperform on various tasks but struggle with STEM topics.
Conclusion:
Data augmentation and fine-tuning enhance language models for low resource languages.
Models exhibit limitations and require further evaluation for production deployment.
Stats
"Amqa: Amharic question answering dataset, 2023."
"An amharic news text classification dataset, 2021."
"Seamlessm4t: Massively multilingual multimodal machine translation, 2023."
"Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation, 2022."
"LoRA: Low-rank adaptation of large language models, 2022."
Quotes
"Large Language Models (LLMs) excel at natural language processing tasks."
"LLMs struggle with low-resource languages like Amharic due to limited training data."
"Data augmentation through machine translation to create diverse Amharic tokens."