Core Concepts
A novel clinical prior guided hierarchical vision-language pre-training framework, IMITATE, that aligns multi-level visual features from medical images with the descriptive and conclusive textual features from hierarchical medical reports, outperforming state-of-the-art methods across various medical imaging downstream tasks.
Abstract
The paper proposes a novel clinical prior guided vision-language pre-training (VLP) framework called IMITATE that addresses the challenge of aligning medical images and reports in the medical domain.
Key highlights:
- Existing VLP methods often simplify medical reports into a unified entity or fragmented tokens, overlooking the inherent hierarchical structure of reports which consist of 'Findings' for descriptive content and 'Impressions' for conclusive observations.
- IMITATE performs hierarchical alignment between multi-level visual features from medical images and the descriptive and conclusive textual features from the hierarchical medical reports.
- A new clinical-informed contrastive loss (CICL) is introduced that incorporates clinical correlations among different image-report pairs during the alignment process.
- Comprehensive experiments on 6 datasets spanning 5 medical imaging downstream tasks demonstrate the superior performance of IMITATE compared to state-of-the-art methods.
Stats
The MIMIC-CXR dataset contains 213,384 image-text pairs after preprocessing.
The CheXpert dataset has 186,027 training, 5,000 validation, and 202 test samples.
The RSNA dataset has 16,010 training, 5,337 validation, and 5,337 test samples.
The COVIDx dataset has 23,988 training, 5,998 validation, and 400 test samples.
The ChestX-ray14 dataset has 77,872 training, 8,652 validation, and 25,596 test samples.
The SIIM dataset has 8,433 training, 1,807 validation, and 1,807 test samples.
The Object-CXR dataset has 6,400 training, 1,600 validation, and 1,000 test samples.
Quotes
"Conventional VLP methods align high-level visual features with the entire medical report, without distinguishing between the descriptive and conclusive sections in the report."
"We hypothesize that low-level visual features embody more descriptive properties of images corresponding to the descriptive part of the report, while high-level visual features contain more semantic information corresponding to the conclusive part of the report."
"Unlike traditional approaches that use a binary affinity matrix as the target, CICL constructs the affinity matrix based on the similarity among different image-report pairs."