TextMonkey is a large multimodal model designed for text-centric tasks like document question answering and scene text analysis. It introduces Shifted Window Attention with zero-initialization to improve cross-window connectivity and stabilize training. By filtering out redundant tokens and incorporating positional information, TextMonkey enhances interpretability and minimizes hallucinations. The model's performance across various benchmark datasets has notably improved, surpassing prior models in document understanding.
TextMonkey can be fine-tuned to comprehend commands for clicking screenshots. The method boosts performance by 5.2%, 6.9%, and 2.8% in Scene Text-Centric VQA, Document Oriented VQA, and KIE, respectively, achieving a score of 561 on OCRBench.
Para outro idioma
do conteúdo fonte
arxiv.org
Principais Insights Extraídos De
by Yuliang Liu,... às arxiv.org 03-08-2024
https://arxiv.org/pdf/2403.04473.pdfPerguntas Mais Profundas