The core message of this paper is to propose a comprehensive tree construction based approach, named Detect-Order-Construct, for hierarchical document structure analysis. This approach decomposes the task into three stages: detecting page objects and assigning logical roles, predicting the reading order of the detected objects, and constructing the intended hierarchical structure, including the table of contents.
Document Layout Analysis models' robustness is benchmarked using RoDLA, introducing a taxonomy of perturbations and proposing metrics for evaluation.
Large Language Models (LLMs) integrated with Visual-rich Document Understanding (VrDU) models improve document analysis tasks.
Introducing a robustness benchmark for Document Layout Analysis models, proposing metrics to evaluate perturbation impact, and presenting the RoDLA model for improved robust feature extraction.
TextMonkey is a large multimodal model tailored for text-centric tasks, enhancing document understanding through innovative approaches.
CFRet-DVQA introduces a retrieval-augmented and efficient tuning framework for Document Visual Question Answering, achieving state-of-the-art results across various datasets.
TextMonkey introduces innovative techniques like Shifted Window Attention and Token Resampler to enhance document understanding through large multimodal models.
The authors explore the transformative impact of language models and transformers on form understanding, showcasing their effectiveness in handling noisy scanned documents.
The author introduces CFRet-DVQA, a framework focusing on retrieval and efficient tuning to enhance Document Visual Question Answering tasks effectively.