Core Concepts
Advancements in large language models have enabled complex instruction understanding in English, but a gap remains in Chinese instruction tuning.
Abstract
The COIG-CQIA dataset aims to bridge the gap in Chinese instruction tuning by providing a high-quality dataset sourced from various Chinese internet platforms.
The dataset is meticulously curated to align with human interactions and improve model behavior.
Models trained on the COIG-CQIA dataset show competitive results in human assessment and knowledge benchmarks.
The dataset includes data from social media, forums, encyclopedias, NLP tasks, and examinations, ensuring diversity and relevance.
Various tasks and evaluations are conducted to analyze the dataset's impact on model performance, safety, and scalability.
Stats
"Data are available at https://huggingface.co/datasets/m-a-p/COIG-CQIA"
Quotes
"The availability of high-quality instruction tuning datasets is crucial for LLMs to operate as efficient and dependable assistants."
"Models trained on CQIA-Subset achieve competitive results in human assessment as well as knowledge and security benchmarks."