The technical report presents the development of TeleChat, a collection of large language models (LLMs) with 3 billion, 7 billion, and 12 billion parameters. The models are first pretrained on a large and diverse corpus of English and Chinese texts, totaling trillions of tokens. The pretraining process involves careful data cleaning, including rule-based filtering, deduplication, high-quality data selection, and security processing.
The report then describes the supervised fine-tuning stage, where the models are further trained on a large dataset of human-annotated prompts and responses to align them with human preferences for conversational AI. The fine-tuning methodology includes data organization, the use of noisy embedding fine-tuning, and multi-stage long-context training to expand the models' context window to 96k tokens.
Additionally, the report discusses the reinforcement learning approach used to further align the models with human safety and norms. The engineering details, including the hardware setup and parallel computing techniques employed, are also provided.
The performance of TeleChat is evaluated on a wide range of benchmarks, including examination tests, language understanding, reasoning, and coding tasks. The results demonstrate that TeleChat outperforms other open-source models of similar size across various domains. To support future research and applications, the report releases the fine-tuned model checkpoints of TeleChat's 7B and 12B variants, along with a portion of the pretraining data.
Finally, the report presents a method for alleviating hallucination in TeleChat by incorporating knowledge graphs, which significantly improves the model's ability to provide accurate answers to factual questions.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Zhongjiang H... at arxiv.org 04-03-2024
https://arxiv.org/pdf/2401.03804.pdfDeeper Inquiries