INDICVOICES presents a dataset of natural and spontaneous speech from 16237 speakers covering 145 districts and 22 languages in India. The dataset includes read, extempore, and conversational audio, aiming to capture the cultural, linguistic, and demographic diversity of the country. The open-source blueprint shared in the paper includes protocols, tools, questions, prompts, and quality control mechanisms for data collection at scale. The dataset will support the development of IndicASR model supporting all 22 languages listed in the Constitution of India. The effort involves personnel across various roles to collect and transcribe hours of speech data while ensuring inclusivity and representation of diversity.
Egy másik nyelvre
a forrásanyagból
arxiv.org
Mélyebb kérdések