Conceptos Básicos
This work presents a comprehensive study of IPA transcription issues and challenges for Bengali, a novel IPA transcription framework, a DUAL-IPA dataset, and DL-based benchmarking results.
Resumen
This paper examines the existing research on the International Phonetic Alphabet (IPA) standard and core Bengali phonemes, identifies current and potential issues, and proposes a framework for a Bengali IPA standard.
The key highlights and insights are:
Detailed discussion on the ongoing scholarly deliberations concerning the IPA standard and core Bengali phonemes, including vowels, semi-vowels, diphthongs, and consonants.
Proposal of a comprehensive IPA transcription framework for Bengali that addresses issues like morphological variations, diphthongs, loan words, and contextual substitution of phonemes.
Introduction of the DUAL-IPA dataset - a novel 150k sentence-level parallel corpus with IPA transcriptions, created using the proposed framework and validated by linguists.
Benchmarking results using a simple LLM-based seq2seq model, achieving a Word Error Rate (WER) of 0.1 on the test dataset.
The work has the potential to contribute to linguistic theory, NLP dataset creation, and facilitating LLM downstream tasks for the Bengali language.
Estadísticas
The dataset contains 150k sentences, with an average of
The train split contains 100k sentences and the test split contains 50k sentences. There are about 130k unique words in the training data and 35k out of vocabulary words in the test dataset.
Citas
"This work examines prior research, identifies current and potential issues, and suggests a framework for a Bengali IPA standard, facilitating linguistic analysis and NLP resource creation and downstream technology development."
"We open-source the dataset with the CC BY-SA 4.0 license."