The report presents the development of Sailor, a family of open language models tailored for South-East Asian (SEA) languages. Key insights and techniques used in the development process are discussed:
Data Preprocessing:
Tokenization:
Training:
The report also provides details on the data sources used, including high-quality English and Chinese datasets for replay, as well as datasets for SEA languages such as CC100, MADLAD-400, Wikipedia, and OpenSubtitles. The preprocessing pipeline, including data normalization, cleaning, and deduplication, is thoroughly described. Finally, the training infrastructure and details are outlined, highlighting the use of Megatron-LLM and TinyLlama codebases for efficient multi-GPU training.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Longxu Dou,Q... a las arxiv.org 04-05-2024
https://arxiv.org/pdf/2404.03608.pdfConsultas más profundas