Core Concepts
OpenChemIE is a comprehensive system that extracts detailed chemical reactions, including reactants, products, and reaction conditions, from chemistry literature by integrating information across text, tables, and figures.
Abstract
OpenChemIE is a toolkit designed to extract comprehensive reaction data from chemistry literature. It addresses the challenge of integrating information across multiple modalities, including text, tables, and figures, to obtain complete reaction descriptions.
The key components of OpenChemIE include:
Figure Analysis:
Molecule Detection (MolDetect): Identifies molecular sub-images within figures.
Text-Figure Coreference (MolCoref): Aligns molecule identifiers in text with their corresponding structures in figures.
Reaction Diagram Parsing (RxnScribe): Extracts reaction schemes, including reactants, products, and conditions, from reaction diagrams.
Molecule Recognition (MolScribe): Translates molecular images into their SMILES representations.
Text Analysis:
Named Entity Recognition (ChemNER): Extracts chemical entities from text.
Reaction Extraction (ChemRxnExtractor): Identifies reactions described in text.
Multimodal Integration:
Reaction Condition Alignment: Combines reaction conditions from tables with the corresponding reactions in figures.
R-Group Resolution: Infers complete molecular structures by substituting R-groups defined in tables or figures into reaction templates.
OpenChemIE was evaluated on a newly annotated dataset of 1007 reactions from 78 substrate scope figures across 5 chemistry journals. It achieved an F1 score of 69.5% on this challenging task, demonstrating its ability to extract detailed reaction data by integrating information from multiple modalities. Additionally, in an end-to-end evaluation against the Reaxys database, OpenChemIE attained an accuracy of 64.3%.
The toolkit is available as an open-source package and a web interface, enabling broader usage and future development in this area.
Stats
The reaction dataset used to evaluate OpenChemIE contains 1007 reactions from 78 substrate scope figures across 5 chemistry journals.