Główne pojęcia
TextHawk2 is a novel bilingual vision-language model that excels in OCR, grounding, and general multimodal understanding tasks while using significantly fewer image tokens compared to previous models.
Statystyki
TextHawk2 compresses visual tokens by a factor of 16.
The model is pre-trained on a dataset of 100 million samples.
TextHawk2 achieves 78.4% accuracy on OCRBench.
It achieves 81.4% accuracy on ChartQA.
It achieves 89.6% ANLS on DocVQA.
It achieves 88.1% accuracy@0.5 on RefCOCOg-test.
Cytaty
"We present TextHawk2, a bilingual LVLM featuring efficient fine-grained perception and demonstrating cutting-edge performance across general-purpose, OCR, and grounding tasks with 16 times fewer image tokens."
"Critical improvements include: (1) Token Compression: Building on the efficient architecture of its predecessor, TextHawk2 significantly reduces the number of tokens per image by 16 times, facilitating training and deployment of the TextHawk series with minimal resources."
"We demonstrate that our thoughtfully designed resampler can compress visual tokens by a factor of 16 without compromising fine-grained perception capabilities."