TextMonkey is a large multimodal model designed for text-centric tasks like document question answering and scene text analysis. It introduces Shifted Window Attention with zero-initialization to improve cross-window connectivity and stabilize training. By filtering out redundant tokens and incorporating positional information, TextMonkey enhances interpretability and minimizes hallucinations. The model's performance across various benchmark datasets has notably improved, surpassing prior models in document understanding.
TextMonkey can be fine-tuned to comprehend commands for clicking screenshots. The method boosts performance by 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, achieving a score of 561 on OCRBench.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询