Core Concepts
Grounding medical knowledge with appropriate anatomical regions improves the learning of domain-general representations of chest X-ray images and radiology reports, leading to performance gains in various downstream tasks.
Abstract
The paper proposes a grounded knowledge-enhanced medical vision-language pre-training (GK-MVLP) framework to improve the learning of domain-general representations of chest X-ray images and radiology reports.
Key highlights:
The framework leverages fine-grained alignment between visual information in chest X-ray images and medical knowledge by grounding the knowledge to the appropriate anatomical regions.
Medical knowledge prompts are constructed to provide instance-level abnormality location information, preventing the injection of irrelevant knowledge during the decoding stage.
Experiments show that GK-MVLP outperforms or matches the state-of-the-art performance on downstream tasks such as disease classification, disease localization, report generation, and medical visual question answering.
Ablation studies demonstrate the importance of the grounding mechanism in improving cross-modality representation learning.
The proposed GK-MVLP framework effectively addresses the challenges of optimal alignment between visual and textual information in the chest X-ray domain, and the injection of relevant medical knowledge, leading to improved performance across various medical imaging and language tasks.
Stats
The MIMIC-CXR dataset contains 166,504 image-report pairs used for pre-training.
The Chest ImaGenome dataset provides anatomical region annotations used to construct medical knowledge prompts.
Downstream tasks use the following datasets:
RSNA Pneumonia: 25,184 training, 1,500 validation, 3,000 testing samples
NIH ChestX-ray: 78,468 training, 11,219 validation, 22,433 testing samples
CheXpert: 218,414 training, 5,000 validation, 234 testing samples
IU X-Ray: 2,069 training, 296 validation, 590 testing samples
VQA-RAD: 3,064 training, 451 testing samples
Quotes
"Grounding medical knowledge with the appropriate anatomical regions permits performance gain in various chest X-ray tasks."
"Cross-modality representation learning can be improved by our proposed GK-MVLP framework which offers additional information from grounding medical knowledge with the corresponding abnormal anatomical regions."