IndicLLMSuite: A Blueprint for Indian Language Datasets
The authors present a comprehensive suite of resources designed to develop Indic Large Language Models (LLMs) to bridge the gap in data availability for non-English languages, focusing on 22 Indian languages. Their approach combines curated, unverified, and synthetic data to create a robust dataset for pre-training and fine-tuning.