Core Concepts
This study proposes a transformer-based model for accurate optical text recognition in Bengali and Nepali, two low-resource languages with unique script characteristics. The model achieves low character and word error rates, demonstrating its potential for practical applications like document digitization and text extraction.
Abstract
The paper discusses the development and evaluation of a transformer-based optical character recognition (OCR) model for Bengali and Nepali, two low-resource languages in South Asia.
Key highlights:
Unique characteristics of Bengali and Nepali scripts, such as dependent letters, overlapping characters, word separation, and confusing shapes, make OCR a challenging task.
The proposed model uses a transformer-based approach, leveraging the encoder-decoder architecture of the TrOCR model.
The encoder, a Vision Transformer (ViT), extracts features from the input images, while the decoder, a multilingual RoBERTa model, generates the recognized text.
The model was trained on the BanglaWriting dataset for Bengali and a manually curated dataset for Nepali.
During training, the model achieved low Word Error Rates (WER) of 0.10 and 0.14, and Character Error Rates (CER) of 0.04 and 0.10, for Bengali and Nepali, respectively.
On the test set, the model maintained a strong performance, with average CER and WER of 0.07 and 0.12 for Bengali, and 0.11 and 0.15 for Nepali.
The results demonstrate the model's potential for practical applications in text extraction and document digitization for these low-resource languages.
Stats
The model was trained on the BanglaWriting dataset, which contains 21,234 words and 32,784 characters, and a manually curated dataset for Nepali with around 7,000 words and 10,000 characters.
Quotes
"The outcomes demonstrate that the suggested technique achieves high precision in deciphering typed text in Bengali and Nepali."
"This study could aid in the advancement of linguistic technology in South Asia and be useful in a number of industries, including automation, administration, and education."