Core Concepts
This paper proposes a system to automatically extract, standardize, and visualize disease and demographic data from electronic health records to support physician research activities.
Abstract
The paper presents a methodology to process electronic health record (EHR) data for research purposes. The key steps are:
Pre-processing:
Normalizing demographic data (age, gender, diagnosis date) into standard formats
Cleaning missing values
Annotation:
Using a machine learning-based Named Entity Recognition (NER) model to automatically identify disease names in free-text diagnosis descriptions
Comparing the NER model performance against a dictionary-based system (MetaMap)
Transformation:
Mapping the recognized disease names to the standardized ICD-10 coding system
Generating ICD-10 code, name, and category for each disease
Evaluation:
Comparing the accuracy of the NER model (81%) against the dictionary-based system (67%)
Visualization:
Presenting the standardized disease and demographic data in an interactive dashboard to support research activities
Allowing users to explore disease statistics by factors like age, gender, and time
The proposed system aims to make EHR data more usable for research by addressing challenges like data heterogeneity and lack of standardization. It provides a comprehensive approach to extract, normalize, and visualize disease-related insights from EHRs.
Stats
The diagnosis text "Referred from shortness of breath/ pulmonary embolism /accepted by medical." contains the key metrics "shortness of breath" and "pulmonary embolism".
The diagnosis text "phobic anxiety with major depressive disorder." contains the key metrics "phobic anxiety" and "major depressive disorder".
The diagnosis text "Colon cancer for liver evaluation" contains the key metric "Colon cancer".
Quotes
"The EHRs are a valuable resource for clinical studies and research activities [3] and [4]."
"ICD provides an effective way to unify diseases data between different sources as an international classification and provides an accurate specification of the disease category."
"Only 60% ∼ 80% of the assigned ICD codes reflect the exact patient medical diagnosis [8]."