Ziya2, a 13-billion-parameter language model, is developed through a data-centric approach that focuses on optimizing the use of pre-training data to enhance the model's capabilities in Chinese, mathematics, and programming tasks, while maintaining or improving its performance on general English benchmarks.
The LLM-ADE framework introduces a novel approach to continual pre-training of large language models, enabling efficient integration of new datasets while mitigating catastrophic forgetting and double descent.
Continual pre-training of the Llama-3 language model with an optimal mixture ratio of additional Chinese corpus can substantially improve its performance on Chinese-related tasks as well as certain domain-specific capabilities like math, coding, and emotional intelligence.