Bibliographic Information:
Xu, J., Yu, C., Xu, J., Ding, Y., Torvik, V. I., Kang, J., Sung, M., & Song, M. (2023). PubMed knowledge graph 2.0: Connecting papers, patents, and clinical trials in biomedical science. Scientific Data, 10(1), 415. https://doi.org/10.1038/s41597-023-02450-6
Research Objective:
This paper introduces PKG 2.0, an updated version of the PubMed Knowledge Graph (PKG), aiming to address the limitations of existing knowledge graphs by integrating papers, patents, and clinical trials to provide a more comprehensive and interconnected view of biomedical research.
Methodology:
PKG 2.0 integrates data from various sources, including PubMed, ClinicalTrials.gov, USPTO, and NIH Exporter. The authors employed several techniques to link these data sources, including:
- Biomedical Entity Extraction and Relation Mapping: Using BERN2, a biomedical text mining tool, to extract and normalize biomedical entities from papers, patents, and clinical trials, and mapping their relationships using the iBKH dataset.
- Author Name Disambiguation: Combining data from Author-ity and Semantic Scholar, and training a deep neural network model to resolve conflicts and disambiguate author names across different data sources.
- Citation Integration: Integrating citation data from PubMed, NIH Open Citation Collection (NIH-OCC), and the OpenCitations Index of Crossref open DOI-to-DOI citations (COCI) to create a more complete citation network.
- Project Linking: Utilizing data from the NIH Exporter to link NIH-funded projects with their corresponding papers, patents, and clinical trials.
Key Findings:
- PKG 2.0 encompasses over 36 million papers, 1.3 million patents, and 0.48 million clinical trials, establishing fine-grained connections between these document types through bioentities, citations, disambiguated authors, and projects.
- The integration of multiple data sources and the use of advanced techniques for entity extraction, disambiguation, and linking resulted in a comprehensive and interconnected knowledge graph that provides a holistic view of the biomedical research landscape.
- PKG 2.0 has been validated using various datasets and has shown significant improvements in key tasks such as author disambiguation and biomedical entity recognition.
Main Conclusions:
PKG 2.0 offers a valuable resource for biomedical researchers, bibliometric scholars, and those engaged in literature mining by providing a comprehensive and interconnected view of the relationships between papers, patents, and clinical trials. This integration enables a deeper understanding of knowledge transfer, innovation pathways, and the overall scientific ecosystem in biomedicine.
Significance:
PKG 2.0 represents a significant advancement in knowledge graph construction for biomedical research. Its comprehensive integration of diverse data sources and sophisticated linking methodologies provide a powerful tool for exploring the complex relationships within the field, potentially leading to new discoveries and innovations.
Limitations and Future Research:
- The accuracy of author disambiguation and entity linking, while improved, can be further enhanced.
- The current version of PKG 2.0 primarily focuses on biomedical literature; expanding its scope to include other relevant data sources could provide a more holistic view of the research landscape.
- Future research could explore the development of user-friendly interfaces and tools to facilitate the exploration and analysis of the vast amount of data within PKG 2.0.