Core Concepts
Creating a dataset for OCR models that bridges the gap between scientific and printed English text is crucial for accurate text extraction from academic documents.
Stats
"PEaCE contains 1M images of printed English text, 100k images of numerical artifacts, and 100k images of (pseudo-)chemical equations."
"The synthetic portion of PEaCE contains 1M images of printed English text, 100k images of numerical artifacts, and 100k images of (pseudo-)chemical equations."
"The real-world test set comprises 319 carefully curated records."
Quotes
"Optical Character Recognition (OCR) is an established task with the objective of identifying the text present in an image."
"We propose a novel dataset that contains images of both scientific texts and printed English for training and testing OCR models on articles from the hard sciences."