Core Concepts
ChemScraper, a fast and accurate system, leverages the geometric information in born-digital PDF molecule images to extract and parse molecular structures, generating annotated data for training visual parsers.
Abstract
The paper presents ChemScraper, a system for extracting and parsing molecular diagrams from PDF documents. The key highlights are:
ChemScraper leverages the explicit locations and shapes of characters, lines, and polygons in born-digital PDF molecule images, without requiring image processing, Optical Character Recognition (OCR), or vectorization.
The parsing pipeline involves a series of graph transformations to capture both visual and chemical structure, producing editable ChemDraw (CDXML) files as output. This includes tokenizing primitives, constructing a Minimum Spanning Tree (MST), and applying graph transformations to identify atoms, bonds, and other molecular structures.
ChemScraper generates annotated data for training visual parsers, with primitive-level annotations for all graphical primitives, atoms, and bonds. This addresses the challenge of obtaining comprehensive training data for visual parsers, which is crucial for parsing molecules directly from raw images.
The system is evaluated using SMILES strings, molecular fingerprints, and labeled directed graphs, demonstrating high accuracy. The use of direct comparisons of labeled graphs over PDF drawing primitives is a contribution, enabling direct comparison of graphical structures and automatic compilation of structure recognition errors.
Stats
The paper does not provide any specific numerical data or statistics. The focus is on the system design and evaluation methodology.
Quotes
There are no direct quotes from the paper included in the output.