المفاهيم الأساسية
PDF-MVQA is a new dataset that enables the examination of semantically hierarchical layout structures in text-dominant documents, allowing the development of innovative models capable of navigating and interpreting real-world documents at a multi-page or entire document level.
الملخص
The PDF-MVQA dataset is introduced to address the limitations of existing document understanding datasets and models. It focuses on research journal articles, which are text-dominant and visually rich, containing components such as titles, paragraphs, tables, and charts organized across multiple pages.
The key highlights of the dataset and the proposed frameworks are:
-
PDF-MVQA dataset:
- Collected from PubMed Central, containing 3,146 research articles with 30,239 pages in total.
- Includes annotated document entities (paragraphs, tables, figures) and associated questions.
- Designed to evaluate the understanding of layout and logical structure, especially in multi-page documents.
-
Proposed Frameworks:
- RoI-based and Patch-based frameworks that leverage pretrained Vision-Language models to obtain enhanced entity representations.
- Joint-grained framework that combines coarse-grained entity representations with fine-grained token-level information to improve robustness.
- Extensive experiments and analyses demonstrate the effectiveness of the proposed frameworks in retrieving target document entities, especially in complex sections and across multiple pages.
The research aims to advance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA tasks.
الإحصائيات
The PDF-MVQA dataset contains 3,146 documents with a total of 30,239 pages.
The dataset includes 262,928 question-answer pairs.
The number of questions per Super-Section: Introduction (31,719), Materials and Methods (58,437), Results and Discussion (116,521), Conclusion (7,715), Others (48,536).
The number of questions per entity type: Paragraph (231,172), Table (7,188), Figure (10,756).
اقتباسات
"PDF-MVQA addresses the limitations of generative models in answering knowledge-intensive questions. It expands upon the benefits of retrieval-based models by incorporating multimodal document entities like paragraphs, tables and figures and exploring the cross-page layout and logical correlation between them."
"The research aims to advance the capabilities of existing vision-and-language models in handling challenges posed by text-dominant documents in VRD-QA tasks."