TextMonkey is a large multimodal model designed for text-centric tasks like document question answering and scene text analysis. It introduces Shifted Window Attention with zero-initialization to improve cross-window connectivity and stabilize training. By filtering out redundant tokens and incorporating positional information, TextMonkey enhances interpretability and minimizes hallucinations. The model's performance across various benchmark datasets has notably improved, surpassing prior models in document understanding.
TextMonkey can be fine-tuned to comprehend commands for clicking screenshots. The method boosts performance by 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, achieving a score of 561 on OCRBench.
Başka Bir Dile
kaynak içeriğinden
arxiv.org
Önemli Bilgiler Şuradan Elde Edildi
by Yuliang Liu,... : arxiv.org 03-08-2024
https://arxiv.org/pdf/2403.04473.pdfDaha Derin Sorular