Core Concepts
This study documents the development of two open-source compact language models, the TeenyTinyLlama (TTL) pair, tailored for low-resource settings and trained solely on Brazilian Portuguese text.
Abstract
The authors developed two compact language models, the TeenyTinyLlama (TTL) pair, for Brazilian Portuguese text generation. The models were trained from scratch on a dataset of 6.2 billion tokens, including both plain text and instruction-following demonstrations.
Key highlights:
The TTL models have 160 million and 460 million parameters, respectively, designed to be efficient for low-resource settings.
The authors trained custom Sentencepiece tokenizers to improve encoding efficiency for Brazilian Portuguese compared to the original Llama 2 tokenizer.
Evaluation on benchmarks like ARC-Challenge, HellaSwag, MMLU, and TruthfulQA shows the TTL models perform competitively with larger models.
Fine-tuning on downstream tasks like toxicity detection, textual entailment, sentiment analysis, and text classification also demonstrates the models' capabilities.
The authors provide detailed information on the training process, including energy consumption and carbon emissions, and release the models under an Apache 2.0 license.
Limitations include the need for more standard benchmarks for low-resource languages and the models' potential to generate hallucinations, biases, and toxic content.
Future work includes scaling the models to 1 billion parameters and expanding the training dataset to 1 trillion tokens.
Stats
Our 460 million parameter model consumed 115.69 kWh of energy and generated 41.31 KgCO2eq during training.
The 160 million parameter model consumed 15.5 kWh of energy and generated 5.7 KgCO2eq during training.
Quotes
"Large language models have radically changed the field of natural language processing (NLP) with their exceptional ability to perform downstream tasks after being trained on vast amounts of data in a self-supervised learning regime."
"Despite the tremendous success of the field, progress has yet to be made equally regarding all languages."
"This study follows the trend of developing LLMs tailored for low-resource regimes."