Conceptos Básicos
A zero-shot method for medical phrase grounding that leverages the cross-attention mechanisms within a pre-trained Latent Diffusion Model to extract heatmaps indicating regions of image-text alignment, without any further fine-tuning.
Resumen
The paper presents a novel approach for performing medical phrase grounding in a zero-shot setting using a pre-trained Latent Diffusion Model (LDM). The key idea is to leverage the cross-attention mechanisms within the LDM, which inherently align visual and textual features, to extract heatmaps that indicate the regions where the input image and text prompt are maximally aligned.
The authors first provide an overview of the LDM architecture and its ability to incorporate external information, such as text, into the model. They then describe their proposed pipeline for zero-shot phrase grounding, which involves:
Gathering the attention maps from multiple cross-attention blocks and timesteps of the reverse diffusion process within the LDM.
Aggregating these attention maps by averaging to obtain a single activation heatmap that matches the spatial dimensions of the input image.
Applying additional post-processing techniques, such as binary Otsu thresholding, to refine the generated heatmap without introducing any learnable parameters.
The authors evaluate their method on the MS-CXR benchmark dataset and compare it against state-of-the-art discriminative baselines that are trained in a fully supervised or self-supervised manner. The results show that their proposed zero-shot approach is competitive with the baselines and even outperforms them on average in terms of mean IoU and AUC-ROC metrics.
The authors also provide an ablation study to justify their choices of cross-attention layers and timesteps, as well as the impact of post-processing. Additionally, they present a qualitative analysis to highlight the strengths and limitations of their method compared to the strongest baseline.
Overall, the paper demonstrates the potential of leveraging pre-trained generative models, such as the LDM, for downstream tasks in the medical imaging domain, without the need for any further fine-tuning.
Estadísticas
The average mean IoU across 8 pathology classes is 22.6%.
The average AUC-ROC across 8 pathology classes is 73.8%.
The average |CNR| (contrast-to-noise ratio) across 8 pathology classes is 0.92.
Citas
"Our proposed framework might be used to automatically link reports to the relevant image locations, allowing fast inclusion of key images and easy navigation when reviewing a previous scan."
"We might also extend to the task of diagnosis by creating text prompts such as 'Where is {pathology label}?', to achieve an off-the-shelf detector that provides automated diagnosis."