Conceitos essenciais
A parameter-efficient hybrid text recognition method based on pre-trained OCR Transformer, DLoRA-TrOCR, which embeds DoRA into the image encoder and LoRA into the text decoder, enabling efficient fine-tuning for mixed handwritten, printed, and street view text recognition.
Resumo
This study focuses on mixed text recognition, which involves effectively fine-tuning a pre-trained OCR Transformer model to demonstrate exceptional performance across various downstream tasks. The authors propose a parameter-efficient hybrid text recognition method, DLoRA-TrOCR, which embeds DoRA into the image encoder and LoRA into the text decoder.
Key highlights:
- The authors constructed a mixed dataset of English handwritten, printed, and complex scene text to cover diverse image formats.
- DLoRA-TrOCR pioneers the application of a mixed-text OCR baseline model and a novel use of the PEFT method, addressing the inherent complexity of fine-tuning Transformer-based Vision-Language models.
- Comprehensive experiments on benchmark datasets for handwritten, printed, and scene text recognition tasks validate the effectiveness and efficiency of the proposed method. DLoRA-TrOCR achieves state-of-the-art performance on complex scene data sets involving simultaneous recognition of mixed text types, while using a significantly smaller number of trainable parameters compared to similar methods.
Estatísticas
The IAM handwritten text dataset contains 6,842 lines of text in the training set and 2,915 lines in the test set.
The SROIE printed text dataset contains 10,682 lines of text in the training set and 6,897 lines in the test set.
The street view text datasets (IIIT5K, SVT, IC13, IC15, SVTP, CUTE) contain a total of 7,573 lines of text in the training set and 11,435 lines in the test set.
Citações
"Our research aims to extend the application of pre-training and fine-tuning to the OCR domain."
"Experimental results show that compared to similar parameter adjustment methods, our model DLoRA-TrOCR has the smallest number of parameters and performs better."