核心概念
Bridging the gap in language model development by providing resources and tools for Indic languages.
摘要
IndicLLMSuite introduces a suite of resources for developing Indic LLMs, covering 22 languages with 251B tokens and 74.8M instruction-response pairs. The approach combines curated data, unverified data, and synthetic data. A pipeline is built for curating pre-training data from various sources like websites, PDFs, and videos. For fine-tuning, existing datasets are amalgamated, English datasets are translated into Indian languages, and toxicity alignment is addressed. The released datasets aim to propel research in Indic LLMs and serve as a blueprint for other languages.
統計資料
Our work aims to bridge the divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs, covering 22 languages with a total of 251B tokens and 74.8M instruction-response pairs.
We build a clean, open-source pipeline for curating pre-training data from diverse sources including websites, PDFs, and videos.
Toxic prompts are generated for multiple scenarios to address toxicity alignment in Indic LLMs.
The datasets released aim to propel research and development of Indic LLMs while establishing an open-source blueprint for extending such efforts to other languages.
引述
"We hope that the datasets, tools, and resources released as a part of this work will not only propel the research and development of Indic LLMs but also establish an open-source blueprint for extending such efforts to other languages." - Content
"Our work aims to bridge this divide by introducing an expansive suite of resources specifically designed for the development of Indic LLMs." - Content
"Additionally, we address toxicity alignment by generating toxic prompts for multiple scenarios." - Content
"The data and other artifacts created as part of this work are released with permissive licenses at https://github.com/AI4Bharat/IndicLLMSuite" - Content