Core Concepts
Synthetic medical images generated from real radiology reports can effectively substitute real images in vision-language pre-training, achieving comparable or even superior performance on downstream medical vision tasks.
Abstract
The paper investigates the feasibility and effectiveness of using synthetic medical images for vision-language pre-training (VLP) in the medical domain. The authors employ two text-guided image generation models - Stable Diffusion (SD) and RoentGen - to generate synthetic chest X-ray (CXR) images conditioned on real radiology reports from the MIMIC-CXR dataset.
The authors then pre-train three state-of-the-art VLP methods (ConVIRT, GLoRIA, MGCA) exclusively on the synthetic image-text pairs and evaluate their performance on three downstream tasks: image classification, semantic segmentation, and object detection.
The key findings are:
- VLP models pre-trained on synthetic images from the domain-specific RoentGen model achieve comparable or even superior performance compared to those pre-trained on real image-text pairs.
- In contrast, models pre-trained on synthetic images from the generic Stable Diffusion model show a significant decline in performance across all tasks.
- More granular vision-language alignment (e.g., GLoRIA, MGCA) results in improved performance on the synthetic dataset, suggesting the synthetic images capture rich localized information that can be effectively aligned with real reports.
The authors conclude that domain-specific generative models have significant potential to generate realistic synthetic medical data for VLP, addressing the data scarcity issue and enabling new ways to share multi-modal medical datasets while balancing privacy concerns.
Stats
The pre-training dataset consists of 213,384 synthetic CXR images paired with real medical reports from the MIMIC-CXR dataset.
The downstream evaluation is performed on three image classification datasets (CheXpert, RSNA, COVIDx), two semantic segmentation datasets (SIIM, RSNA), and two object detection datasets (RSNA, Object-CXR).
Quotes
"Interestingly, methods pre-trained on synthetic images from RoentGen exhibit performance that is comparable to, or even surpasses, those pre-trained on real images for downstream tasks."
"However, variants pre-trained on synthetic images from SD show a significant decline in performance across all visual tasks."