Core Concepts
Sailor is a family of open language models ranging from 0.5B to 7B parameters, designed to perform well across South-East Asian languages including English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao.
Abstract
The report presents the development of Sailor, a family of open language models tailored for South-East Asian (SEA) languages. Key insights and techniques used in the development process are discussed:
Data Preprocessing:
Merging adjacent short examples to reconstruct context
Employing document-level code-switching to improve multilingual performance
Aggressive data cleaning and deduplication to improve data quality
Tokenization:
Utilizing BPE Dropout to enhance model robustness against minor prompt variations
Training:
Tuning the learning rate to balance performance on English and SEA languages
Conducting data mixture simulation experiments to optimize the joint loss across all languages
The report also provides details on the data sources used, including high-quality English and Chinese datasets for replay, as well as datasets for SEA languages such as CC100, MADLAD-400, Wikipedia, and OpenSubtitles. The preprocessing pipeline, including data normalization, cleaning, and deduplication, is thoroughly described. Finally, the training infrastructure and details are outlined, highlighting the use of Megatron-LLM and TinyLlama codebases for efficient multi-GPU training.
Stats
The training data for Sailor models consists of 140B high-quality tokens for SEA languages and 60B tokens for replay (English and Chinese).
The effective tokens and equivalent epochs for each language and data source are provided in Table 3.