The Intellecta Cognitiva dataset is a comprehensive 11.53 billion token compendium designed to advance the capabilities of language models. It is composed of 30.5% textbook data from scholarly publications and 69.5% synthetic data across various domains, including programming, mathematics, natural language processing, and more.
The synthetic data is generated using the Mixtral-8x7B-Instruct-v0.1 model, which is capable of simulating complex thought processes and producing textbook-style explanations. This dual-composed content aims to mirror the learning process of a human, where they not only learn from textbooks but also engage in a thought process to understand and apply the knowledge.
The curation of the Intellecta dataset involves a rigorous process, including data acquisition, OCR conversion, synthetic data generation, cleaning, normalization, and ethical screening. Techniques like Simhash are employed to remove duplicates, and the Perspective API is used to identify and eliminate toxic content, ensuring the dataset adheres to high ethical standards.
The dataset covers a diverse range of topics, each selected for its substantial educational value. The topics span scientific, technical, and literary domains, providing language models with a comprehensive learning experience. Cluster analysis visualizes the semantic relationships between the topics, highlighting the dataset's depth and multifaceted nature.
Evaluation of the Intellecta model, a 634 million parameter model trained on the dataset, demonstrates its competitive performance across various benchmarks, including ARC, HellaSwag, MMLU, Winogrande, and GSM8K. The results suggest that the Intellecta dataset can produce high-quality language models capable of significant cross-domain generalization, even with relatively fewer parameters and tokens compared to larger models.
The Intellecta Cognitiva dataset represents a transformative stride in synthetic data generation, aiming to enhance the cognitive capabilities of language models and drive advancements in AI research. Its strategic design, rigorous curation, and commitment to quality and ethical standards position it as a critical resource for the AI community.
Till ett annat språk
från källinnehåll
arxiv.org
Viktiga insikter från
by Ajmal PS,Dit... på arxiv.org 04-23-2024
https://arxiv.org/pdf/2404.13065.pdfDjupare frågor