INDICVOICES presents a dataset of natural and spontaneous speech from 16237 speakers covering 145 districts and 22 languages in India. The dataset includes read, extempore, and conversational audio, aiming to capture the cultural, linguistic, and demographic diversity of the country. The open-source blueprint shared in the paper includes protocols, tools, questions, prompts, and quality control mechanisms for data collection at scale. The dataset will support the development of IndicASR model supporting all 22 languages listed in the Constitution of India. The effort involves personnel across various roles to collect and transcribe hours of speech data while ensuring inclusivity and representation of diversity.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Tahir Javed,... alle arxiv.org 03-05-2024
https://arxiv.org/pdf/2403.01926.pdfDomande più approfondite