READOC is a novel benchmark that frames document structured extraction as a realistic, end-to-end task of converting unstructured PDFs into semantically rich Markdown text, enabling a unified evaluation of state-of-the-art approaches.
CREPEは、単一の文書画像に複数の文書が含まれる場合でも、各文書の解析結果を区別して出力できる。また、文書内の文字列の位置座標も同時に出力することができる。
PDF-MVQA is a new dataset that enables the examination of semantically hierarchical layout structures in text-dominant documents, allowing the development of innovative models capable of navigating and interpreting real-world documents at a multi-page or entire document level.
LayoutLLM is an LLM/MLLM based method that integrates a document pre-trained model as encoder and employs a novel layout instruction tuning strategy to enhance the comprehension and utilization of document layouts for improved zero-shot document understanding.
OMNIPARSER is a unified framework that can simultaneously handle text spotting, key information extraction, and table recognition tasks through a single, concise model design.
다중 교사를 활용한 복합적인 문서 이해 모델의 효과적인 성능과 지식 전달
統一構造学習によるOCRフリー文書理解の向上
提案されたGRAMアプローチは、既存の単一ページ文書モデルを拡張し、多ページドキュメントを効率的に処理する方法を提供します。
Enhancing Multimodal Large Language Models with Unified Structure Learning for improved OCR-free Document Understanding.
Efficiently extending single-page models to handle multi-page documents in visual question answering.