Core Concepts
Fine-tuned vision-language models (VLMs) often learn misleading correlations between irrelevant image features and text attributes, leading to poor performance. RAVL, a novel region-aware learning approach, addresses this by identifying and mitigating these spurious correlations at a fine-grained level, improving VLM robustness and accuracy.
Varma, M., Delbrouck, J.-B., Chen, Z., Chaudhari, A., & Langlotz, C. (2024). RAVL: Discovering and Mitigating Spurious Correlations in Fine-Tuned Vision-Language Models. Advances in Neural Information Processing Systems, 38.
This paper introduces RAVL, a novel method for identifying and mitigating spurious correlations in fine-tuned vision-language models (VLMs), aiming to improve their robustness and zero-shot performance.