The report presents the development of Sailor, a family of open language models tailored for South-East Asian (SEA) languages. Key insights and techniques used in the development process are discussed:
Data Preprocessing:
Tokenization:
Training:
The report also provides details on the data sources used, including high-quality English and Chinese datasets for replay, as well as datasets for SEA languages such as CC100, MADLAD-400, Wikipedia, and OpenSubtitles. The preprocessing pipeline, including data normalization, cleaning, and deduplication, is thoroughly described. Finally, the training infrastructure and details are outlined, highlighting the use of Megatron-LLM and TinyLlama codebases for efficient multi-GPU training.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Longxu Dou,Q... kl. arxiv.org 04-05-2024
https://arxiv.org/pdf/2404.03608.pdfDybere Forespørgsler