核心概念
Sailor is a family of open language models ranging from 0.5B to 7B parameters, designed to perform well across South-East Asian languages including English, Chinese, Vietnamese, Thai, Indonesian, Malay, and Lao.
摘要
The report presents the development of Sailor, a family of open language models tailored for South-East Asian (SEA) languages. Key insights and techniques used in the development process are discussed:
-
Data Preprocessing:
- Merging adjacent short examples to reconstruct context
- Employing document-level code-switching to improve multilingual performance
- Aggressive data cleaning and deduplication to improve data quality
-
Tokenization:
- Utilizing BPE Dropout to enhance model robustness against minor prompt variations
-
Training:
- Tuning the learning rate to balance performance on English and SEA languages
- Conducting data mixture simulation experiments to optimize the joint loss across all languages
The report also provides details on the data sources used, including high-quality English and Chinese datasets for replay, as well as datasets for SEA languages such as CC100, MADLAD-400, Wikipedia, and OpenSubtitles. The preprocessing pipeline, including data normalization, cleaning, and deduplication, is thoroughly described. Finally, the training infrastructure and details are outlined, highlighting the use of Megatron-LLM and TinyLlama codebases for efficient multi-GPU training.
統計資料
The training data for Sailor models consists of 140B high-quality tokens for SEA languages and 60B tokens for replay (English and Chinese).
The effective tokens and equivalent epochs for each language and data source are provided in Table 3.