Core Concepts
A deep learning model using Artificial Neural Networks can effectively recognize the geographical division of Bangladeshi speakers from their continuous Bengali speech.
Abstract
The researchers developed a method to recognize the geographical division of Bangladeshi speakers from their continuous Bengali speech using Artificial Neural Networks. They collected over 45 hours of audio data from 633 speakers across 8 divisions of Bangladesh and performed preprocessing tasks like noise reduction and audio segmentation.
The key highlights of the study are:
- They extracted Mel Frequency Cepstral Coefficients (MFCC) and delta features from the speech data as input to the neural network model.
- The proposed Artificial Neural Network model had 5 dense layers with ReLU activation, dropout regularization, and a softmax output layer for 8-way division classification.
- The model was trained using the Adam optimizer and categorical cross-entropy loss, achieving a highest accuracy of 85.44% on the validation set.
- The researchers analyzed the model's performance using a confusion matrix, which showed strong classification across the 8 Bangladeshi divisions.
The authors conclude that their deep learning approach can effectively recognize the geographical division of Bangladeshi speakers from their continuous Bengali speech, which has applications in areas like speaker identification, crime investigation, and fraud detection.
Stats
The dataset contained over 45 hours of audio data from 633 speakers (416 male, 217 female) across 8 divisions of Bangladesh.
Each audio sample was segmented into 8-10 second chunks before feature extraction.
Quotes
"Speech is one of the easiest mediums of communication because it has a lot of identical features for different speakers."
"Accent-based speaker recognition is one of the emerging topics for ASR researchers."
"Deep learning algorithms like RNN, ANN, CNN, and LSTM are performing better for speech recognition because of their perfectly structured data."