Core Concepts
Existing explainable AI (XAI) methods are limited in their ability to effectively explain the inner workings of complex medical vision-language models (VLMs) like MedCLIP. A novel approach is proposed to overcome these limitations by combining XAI methods with the interaction between text and image embeddings in VLMs.
Abstract
The paper analyzes the performance of various XAI methods, including gradient backpropagation, occlusion, integrated gradients, and Grad-Shapley, in explaining the inner workings of the MedCLIP VLM. The authors find that these conventional XAI methods exhibit significant limitations when applied to VLMs, often highlighting irrelevant regions of the input image and failing to capture the nuanced interplay between text and visual features.
To address these shortcomings, the authors propose a new approach that applies the XAI methods to the individual embedding spaces of the VLM (i.e., image and text encoders) and then combines the resulting explainability maps using a dot product operation. This method effectively captures the influence of both input modalities on the final model prediction, providing comprehensive and meaningful insights into the VLM's decision-making process.
The proposed approach is evaluated using the MIMIC-CXR dataset, and the results demonstrate its superiority over the conventional XAI methods. The explainability maps generated by the new method are focused and closely aligned with established medical diagnostic practices, highlighting the specific image regions that are crucial for the given text prompt or class label. This contrasts with the broad, non-specific highlights produced by the conventional XAI methods.
The authors also investigate the impact of different text inputs (prompts vs. class labels) on the VLM's performance and the corresponding explainability maps, further highlighting the versatility and effectiveness of the proposed approach.
Stats
The MIMIC-CXR dataset contains approximately 377,110 chest X-ray (CXR) and radiology report pairs.
The authors used a subset of 2,000 randomly selected samples from this dataset for their analysis.
Quotes
"Our proposed approach produces the feature maps depicted in Figure 2. We generated explainability maps using both text prompts (sentences) and class labels to investigate the influence of different text inputs."
"The highlighted pixel locations closely align with established clinical diagnostic procedures for the specified pathology. Additionally, our method effectively illustrates how MedCLIP's focus shifts based on the input text prompt, providing strong evidence of this VLMs' capacity to comprehend text and identify relevant image pixels."