Core Concepts
Ziya2, a 13-billion-parameter language model, is developed through a data-centric approach that focuses on optimizing the use of pre-training data to enhance the model's capabilities in Chinese, mathematics, and programming tasks, while maintaining or improving its performance on general English benchmarks.
Abstract
The paper presents the development of Ziya2, a 13-billion-parameter large language model (LLM) that builds upon the open-source LLaMA2 model. The key highlights are:
Data Processing Pipeline:
The authors propose a comprehensive data processing pipeline that includes data preprocessing, automatic scoring, rule-based filtering, content deduplication, and data evaluation.
This pipeline is used to clean and curate a high-quality pre-training dataset exceeding 700 billion tokens, covering English, Chinese, and multilingual data.
Continual Pre-training Strategy:
The authors adopt a three-stage continual pre-training strategy, where the first stage uses unsupervised data, the second stage incorporates supervised datasets, and the third stage focuses on improving mathematical abilities.
This strategy aims to enhance the model's capabilities in Chinese, mathematics, and programming, while maintaining or improving its performance on general English benchmarks.
Model Improvements:
The authors make several structural improvements to the LLaMA2 architecture, including the tokenizer, positional embedding, layer normalization, and attention mechanisms, to better adapt to the diverse data distribution and improve training efficiency.
Benchmark Evaluation:
Ziya2 is evaluated on six representative benchmarks, including MMLU, CMMLU, C-Eval, GSM8K, MATH, and HumanEval.
The results show that Ziya2 significantly outperforms LLaMA2 and other open-source models of comparable size, especially in Chinese, mathematical, and programming tasks.
Data-centric Scaling Laws:
The authors define three data attributes (Coherence, Readability, and Similarity) and establish data-centric scaling laws to illustrate the impact of different data characteristics on LLM performance.
The findings suggest that improving the semantic and grammatical quality of pre-training data is more effective in enhancing model performance than data augmentation.
Overall, the Ziya2 model demonstrates the effectiveness of the proposed data-centric approach in developing large language models with enhanced capabilities across multiple domains.
Stats
The pre-training dataset for Ziya2 exceeds 700 billion tokens, covering English, Chinese, and multilingual data.
The dataset includes Pile-Pajama (110B tokens), CC (109B tokens), Wudao (48B tokens), Yuan1.0 (193B tokens), Translate (1.5B tokens), Code (191B tokens), Instruct (0.8B tokens), Wanjuan (29B tokens), and MetaMath (0.1B tokens).
Quotes
"Ziya2 significantly outperforms LLaMA2 on all the benchmarks. Specifically, for general English tasks, Ziya2 outperforms LLaMA2 by 6 points on MMLU. For general Chinese tasks, Ziya2 surpasses LLaMA2 by 23 and 24 points on CMMLU and C-Eval, respectively. For specific downstream tasks, Ziya2 outperforms LLaMA2 by 40, 6, and 13 points on GSM8K, MATH, and HumanEval datasets, respectively."
"The results highlight the effectiveness of our continual pre-training strategy. It not only enhances LLaMA2's English capabilities and mitigates catastrophic forgetting but also significantly improves its performance in Chinese, mathematical, and code programming tasks."