核心概念
Large language models like Yi by 01.AI demonstrate advanced capabilities through high-quality data engineering efforts and continual pretraining, leading to strong performance across various benchmarks.
要約
01.AI introduces the Yi model family, showcasing language and multimodal models with advanced capabilities. The models are based on pretrained language models and extended to include chat models, long context models, depth-upscaled models, and vision-language models. The performance of the Yi models is attributed to high-quality data resulting from extensive data engineering efforts. For pretraining, a large corpus of English and Chinese tokens is constructed using a sophisticated data cleaning pipeline. Finetuning involves meticulous polishing of a small-scale instruction dataset. The vision-language model combines chat language with a vision transformer encoder for aligning visual representations with the semantic space of the language model. Continual pretraining extends context length to 200K, demonstrating strong retrieval performance. Increasing the depth of pretrained checkpoints through continual pretraining further enhances performance.
統計
Pretrained checkpoint depth is extended through continual pretraining.
3.1 trillion tokens of English and Chinese corpora are used for pretraining.
Finetuning dataset consists of less than 10K instructions polished over multiple iterations.
Vision-language model aligns visual representations with the semantic space of the language model.
Context length is extended to 200K through lightweight continual pretraining.