核心概念
DOCMASTER is a unified platform designed to enable users to annotate PDF documents, train layout-aware and text-only models for document question-answering, and deploy the trained models for inference, all while preserving data privacy.
摘要
DOCMASTER is a unified platform that addresses the challenges of working with PDF documents for document question-answering (QA) tasks. The platform consists of three main interfaces:
Annotation Interface:
- Allows users to upload PDF documents, input questions, and highlight relevant text spans as answers.
- Leverages PDF.js for frontend rendering and PyMuPDF in the backend to accurately capture layout information of the highlighted text.
- Provides a robust method for mapping user selections from the PDF.js context to the PyMuPDF context.
Training Interface:
- Enables users to select annotated documents from different sessions to train both layout-aware and text-only models using the Hugging Face transformers library.
- Stores annotations and trained model weights in a local SQL database, ensuring privacy.
Inference Interface:
- Allows users to upload new PDF documents, select a trained model, and receive highlighted answers to their questions within the PDF.
- Utilizes the layout information captured during annotation to provide a user-friendly experience, highlighting the relevant text spans in the PDF.
The authors deployed DOCMASTER at the University of California San Diego's International Services and Engagement Office (ISEO) to streamline the processing of supporting documents for student work permit applications. Compared to the manual review process, DOCMASTER led to a seven-fold increase in the average number of documents processed per hour, while also preserving the privacy of sensitive information.
統計資料
The authors report the following performance metrics on a test set of 128 applications (1024 questions):
Exact Match Accuracy (Acc): 76.23% for RoBERTa-base, 75.98% for LayoutLM-base
F1-score (F1): 83.77% for RoBERTa-base, 83.07% for LayoutLM-base
Correctness (Corr): 93.56% for RoBERTa-base, 93.36% for LayoutLM-base
Average Bounding Box Distance (Dist): 1.13% for RoBERTa-base, 1.86% for LayoutLM-base
The authors also measured the throughput of the RoBERTa-base model deployed on an AMD EPYC 7453 28-Core Processor, which resulted in a seven-fold increase in the number of supporting documents that can be reviewed per hour, from 15 to 100.