The paper presents an automated approach, SimpleGermKG, for constructing a knowledge graph that connects germline genes and diseases extracted from biomedical literature. The workflow consists of the following steps:
Data Pre-processing: The abstracts are tokenized and prepared for machine learning using the PunktSentenceTokenizer from NLTK.
Named Entity Recognition: BioBERT, a pre-trained BERT model on biomedical corpora, is used to extract genes and diseases from the germline abstracts.
Named Entity Normalization: An ontology-based and rule-based algorithm is developed to standardize and disambiguate the extracted gene and disease entities by mapping them to master terms from curated dictionaries.
Semantic Relation: A part-whole relation approach is implemented to connect the normalized gene and disease entities with their corresponding PubMed IDs, forming a knowledge graph.
The resulting SimpleGermKG contains 297 genes, 130 diseases, and 46,747 triples, which can be visualized and queried using the Neo4j graph database. The authors discuss the potential applications, limitations, and future research directions for the germline corpora knowledge graph.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문