This study documents the development of two open-source compact language models, the TeenyTinyLlama (TTL) pair, tailored for low-resource settings and trained solely on Brazilian Portuguese text.
Comparable performance to dedicated from-scratch models can be obtained by additionally pretraining available multilingual models even with a limited amount of computation.
The paper proposes a novel method called TaCo (Translation-Assisted Cross-Linguality) that utilizes translations in a chain-of-thought process to efficiently instruction-tune large language models on new languages, especially low-resource ones, through a curriculum-learning approach.
Ziya2, a 13-billion-parameter language model, is developed through a data-centric approach that focuses on optimizing the use of pre-training data to enhance the model's capabilities in Chinese, mathematics, and programming tasks, while maintaining or improving its performance on general English benchmarks.
Sailor is a family of open language models ranging from 0.5B to 7B parameters, designed to perform well across South-East Asian languages including English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao.
TeleChat is a suite of large language models (LLMs) with 3 billion, 7 billion, and 12 billion parameters, developed through extensive pretraining on a diverse corpus and supervised fine-tuning to align with human preferences for conversational AI applications.
HyperCLOVA X is a family of large language models tailored to the Korean language and culture, while also exhibiting strong performance in English, math, and coding.