מושגי ליבה
This study introduces CT-LLM, a 2B-parameter large language model that prioritizes the Chinese language in its pretraining and fine-tuning, demonstrating exceptional performance on Chinese language tasks and competitive abilities in English.
תקציר
This study presents the development of CT-LLM, a 2 billion parameter large language model (LLM) that is uniquely focused on the Chinese language. The key highlights are:
Pretraining Data: CT-LLM was pretrained on a comprehensive corpus of 1,200 billion tokens, including 800 billion Chinese tokens, 300 billion English tokens, and 100 billion code tokens. This strategic composition enables the model's exceptional proficiency in understanding and processing Chinese.
Model Architecture: The model architecture is based on the transformer decoder, incorporating improvements such as multi-head attention, rotary positional embeddings, and SwiGLU activations.
Supervised Fine-Tuning (SFT): The model was further refined through SFT, using both Chinese and English data, which enhanced its capabilities in both languages.
Preference Alignment: The model was optimized for harmlessness and helpfulness using Debate-style Preference Optimization (DPO), leveraging human preference datasets.
Evaluation: CT-LLM was extensively evaluated on a range of benchmarks, including MMLU, C-Eval, and CMMLU. The results demonstrate the model's balanced proficiency across diverse domains, with particular strengths in Chinese language understanding and reasoning.
Chinese Hard Cases Benchmark (CHC-Bench): The authors developed a multidisciplinary Chinese benchmark to assess the model's instruction understanding and following abilities, where CT-LLM exhibited strong performance.
Political Bias Analysis: The study also examined the political biases exhibited by CT-LLM, finding that it occupies a distinct quadrant on the political spectrum compared to models trained on more Western-centric data.
Overall, this research challenges the prevailing paradigm of training LLMs predominantly on English corpora and then adapting them to other languages. By prioritizing the Chinese language in the model's development, CT-LLM offers a new direction for LLM training methodologies, promoting more inclusive and versatile language models.
סטטיסטיקה
The pretraining dataset consists of 1,254.68 billion tokens, including 840.48 billion Chinese tokens, 314.88 billion English tokens, and 99.3 billion code tokens.
The supervised fine-tuning dataset includes 105K pairs of Chinese instruction data and varying ratios of English data.
ציטוטים
"This strategic composition facilitates the model's exceptional proficiency in understanding and processing Chinese, a capability further enhanced through alignment techniques."
"By challenging the prevailing norms of training LLMs primarily on English corpora, CT-LLM expands the horizons of language model training, offering fresh perspectives on the potentialities of non-English-centric LLMs."