TextMonkey is a large multimodal model designed for text-centric tasks like document question answering and scene text analysis. It introduces Shifted Window Attention with zero-initialization to improve cross-window connectivity and stabilize training. By filtering out redundant tokens and incorporating positional information, TextMonkey enhances interpretability and minimizes hallucinations. The model's performance across various benchmark datasets has notably improved, surpassing prior models in document understanding.
TextMonkey can be fine-tuned to comprehend commands for clicking screenshots. The method boosts performance by 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, achieving a score of 561 on OCRBench.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania