Core Concepts
TextHawk is a multimodal large language model designed to excel at complex document-oriented tasks while maintaining outstanding general vision-language capabilities. It introduces several novel components to enhance fine-grained visual perception and information compression for processing high-resolution and high-density document images.
Abstract
The paper presents TextHawk, a multimodal large language model (MLLM) that is specifically designed for document-oriented tasks. TextHawk addresses the unique challenges posed by document images, which typically have higher resolution and information density compared to natural images.
Key highlights:
- ReSampling and ReArrangement (ReSA) module: Reduces the redundancy in document texts and lowers the computational cost of the MLLM by compressing the number of visual tokens.
- Scalable Positional Embeddings (SPEs): Encodes the positions of each local feature while maintaining the scalability of various image sizes.
- Query Proposal Network (QPN): Initializes the queries dynamically among different sub-images to enhance the fine-grained perception.
- Multi-Level Cross-Attention (MLCA) mechanism: Captures the hierarchical structure and semantic relations of document images to further improve the fine-grained visual perceptual ability.
- Enriched multimodal instruction-tuning dataset: The authors create a new dataset, DocGemini, by leveraging the visual capabilities of Gemini-Pro to generate high-quality document-oriented data.
The extensive experiments demonstrate that TextHawk outperforms state-of-the-art methods on both document-oriented and general MLLM benchmarks, showcasing its superior fine-grained document perception and general vision-language abilities.
Stats
It takes approximately 1 day to train LLaVA on a single 8-A100 machine.
TextHawk achieves performance gains of 11.0%, 7.3%, 8.4%, 3.5%, and 5.3% on DocVQA, ChartQA, InfoVQA, TabFact, and WTQ, respectively, when compared to Ureader.
Quotes
"TextHawk excels in both general and document-oriented benchmarks, securing the top spot in 6 out of 9 benchmarks."
"Remarkably, TextHawk even surpasses TextMonkey, which employs a larger visual encoder, on DocVQA and WTQ benchmarks."