The paper presents a comprehensive synthetic data pipeline, DELINE8K, designed to facilitate the semantic segmentation of historical documents. The authors identify limitations in existing datasets, such as lack of class variety and document diversity, which hinder their performance on challenging datasets like the National Archives Form Semantic Segmentation (NAFSS) dataset.
To address these limitations, the authors introduce DELINE8K, a dataset of 8,000 768x768 images with up to four layers: background, handwriting, printed text, and form elements. The background layer is synthesized using DALL·E, while the handwriting, printed text, and form elements are curated from various sources, including the IAM database, CSAFE Handwriting Database, EMNIST, and U.S. government agency forms.
The authors evaluate the performance of models trained on DELINE8K, SignaTR6K, and DIBCO datasets on the NAFSS, SignaTR6K, and DIBCO datasets. The results demonstrate that the model trained on DELINE8K significantly outperforms the other models on the NAFSS dataset, highlighting the effectiveness of the synthetic data pipeline in addressing the diverse and complex nature of historical documents.
The authors also discuss the limitations of their approach, such as the challenges in distinguishing between italic/cursive fonts and handwriting, as well as handling non-standard text alignments. They suggest potential areas for improvement, including joint training on multiple datasets and implementing a two-step procedure for binarization and classification.
Overall, the DELINE8K dataset and the synthetic data pipeline presented in this paper represent a significant advancement in the field of document semantic segmentation, providing a valuable tool for researchers and practitioners working with historical documents.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問