TextMonkey is a large multimodal model designed for text-centric tasks like document question answering and scene text analysis. It introduces Shifted Window Attention with zero-initialization to improve cross-window connectivity and stabilize training. By filtering out redundant tokens and incorporating positional information, TextMonkey enhances interpretability and minimizes hallucinations. The model's performance across various benchmark datasets has notably improved, surpassing prior models in document understanding.
TextMonkey can be fine-tuned to comprehend commands for clicking screenshots. The method boosts performance by 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, achieving a score of 561 on OCRBench.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Yuliang Liu,... klo arxiv.org 03-08-2024
https://arxiv.org/pdf/2403.04473.pdfSyvällisempiä Kysymyksiä