Core Concepts
Leveraging the capabilities of advanced closed-source multimodal language models, the authors construct a massive, high-quality dataset (Square-10M) for text-centric visual instruction tuning, enabling their open-source model TextSquare to outperform existing state-of-the-art open-source models and even match or surpass leading closed-source models on various benchmarks.
Abstract
The paper introduces a new approach, termed "Square", for creating a large-scale, high-quality instruction-tuning dataset (Square-10M) for text-centric visual question answering (VQA). The dataset is generated using closed-source multimodal large language models (MLLMs) through a four-step process: Self-Questioning, Answering, Reasoning, and Evaluation.
The authors first collect a diverse set of text-rich images from various public sources, including natural scenes, charts, tables, receipts, books, slides, PDFs, documents, products, and web images. They then apply the Square method to these images, leveraging the capabilities of advanced closed-source MLLMs to generate high-quality VQA pairs and reasoning context.
The experiments demonstrate several key findings:
The authors' model, TextSquare, considerably surpasses open-source previous state-of-the-art text-centric MLLMs and sets a new standard on OCRBench (62.2%). It even outperforms top-tier closed-source models like GPT4V and Gemini in 6 out of 10 text-centric benchmarks.
The reasoning data provided in Square-10M is shown to be beneficial in improving model performance and mitigating hallucinations in text-centric VQA scenarios, as it can deliver rich question-specific contextual information.
The authors reveal the relationships between data scale, convergence loss, and model performance for text-centric VQA instruction tuning, demonstrating the effectiveness and necessity of the massive Square-10M dataset.
Stats
The total number of deaths in prisons and camps is 1,146,000.
The ratio of the people who approve and those who don't about Putin's handling of Corruption is 2.13.
The page number shown at the bottom of the image is XV.
Quotes
"Leveraging Square-10M, TextSquare achieves a significant outperformance of existing open-source models and even comparable or superior performance to SOTA closed-source models on various benchmarks, e.g., +0.9% on ChartQA, +2.1% on WTQ, +4.3% on SROIE."
"Reasoning data is demonstrated to be beneficial in improving model performance and mitigating hallucinations in text-centric VQA scenarios, as it can deliver rich question-specific contextual information."
"Through extensive experiments, we reveal the relationships between data scale, convergence loss, and model performance for text-centric VQA instruction tuning, which demonstrates the effectiveness and necessity of Square-10M."