The paper introduces a novel suite of quantitative metrics that utilize GradCAM activations to rigorously evaluate the grounding capabilities of pre-trained vision-language models like CLIP, BLIP, and ALBEF. These metrics offer an explainable and quantifiable approach for a more detailed comparison of the zero-shot grounding abilities of these models.


coremsg

quantifying-grounding-capabilities-of-vision-language-models-using-gradcam-activations


Quantifying Grounding Capabilities of Vision-Language Models Using GradCAM Activations