Quantifying Grounding Capabilities of Vision-Language Models Using GradCAM Activations
The paper introduces a novel suite of quantitative metrics that utilize GradCAM activations to rigorously evaluate the grounding capabilities of pre-trained vision-language models like CLIP, BLIP, and ALBEF. These metrics offer an explainable and quantifiable approach for a more detailed comparison of the zero-shot grounding abilities of these models.