INDICVOICES presents a dataset of natural and spontaneous speech from 16237 speakers covering 145 districts and 22 languages in India. The dataset includes read, extempore, and conversational audio, aiming to capture the cultural, linguistic, and demographic diversity of the country. The open-source blueprint shared in the paper includes protocols, tools, questions, prompts, and quality control mechanisms for data collection at scale. The dataset will support the development of IndicASR model supporting all 22 languages listed in the Constitution of India. The effort involves personnel across various roles to collect and transcribe hours of speech data while ensuring inclusivity and representation of diversity.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Tahir Javed,... kl. arxiv.org 03-05-2024
https://arxiv.org/pdf/2403.01926.pdfDybere Forespørgsler