Core Concepts
TeleChat is a suite of large language models (LLMs) with 3 billion, 7 billion, and 12 billion parameters, developed through extensive pretraining on a diverse corpus and supervised fine-tuning to align with human preferences for conversational AI applications.
Abstract
The technical report presents the development of TeleChat, a collection of large language models (LLMs) with 3 billion, 7 billion, and 12 billion parameters. The models are first pretrained on a large and diverse corpus of English and Chinese texts, totaling trillions of tokens. The pretraining process involves careful data cleaning, including rule-based filtering, deduplication, high-quality data selection, and security processing.
The report then describes the supervised fine-tuning stage, where the models are further trained on a large dataset of human-annotated prompts and responses to align them with human preferences for conversational AI. The fine-tuning methodology includes data organization, the use of noisy embedding fine-tuning, and multi-stage long-context training to expand the models' context window to 96k tokens.
Additionally, the report discusses the reinforcement learning approach used to further align the models with human safety and norms. The engineering details, including the hardware setup and parallel computing techniques employed, are also provided.
The performance of TeleChat is evaluated on a wide range of benchmarks, including examination tests, language understanding, reasoning, and coding tasks. The results demonstrate that TeleChat outperforms other open-source models of similar size across various domains. To support future research and applications, the report releases the fine-tuned model checkpoints of TeleChat's 7B and 12B variants, along with a portion of the pretraining data.
Finally, the report presents a method for alleviating hallucination in TeleChat by incorporating knowledge graphs, which significantly improves the model's ability to provide accurate answers to factual questions.
Stats
TeleChat's pretraining corpus contains a diverse collection of texts from both English and Chinese languages, totaling trillions of tokens.
The supervised fine-tuning dataset consists of over 100,000 human-annotated prompts and responses.
Quotes
"To encourage reproducibility of fine-tuned LLMs and foster responsible development of LLMs, we release TeleChat, a collection of chat models that have been fine-tuned using human alignment techniques including supervised fine-tuning and reinforcement learning."
"By employing these techniques, TeleChat successfully extends its context window to over 96k tokens."
"Experimental results demonstrate that by employing these techniques, TeleChat successfully extends its context window to over 96k tokens."