Core Concepts
LVLM-Intrepret is a novel interactive application designed to enhance the interpretability of large vision-language models by providing insights into their internal mechanisms, including image patch importance, attention patterns, and causal relationships.
Abstract
The content presents LVLM-Intrepret, an interpretability tool for large vision-language models (LVLMs). The key highlights are:
LVLM-Intrepret is designed to help users understand the internal workings of LVLMs, which are becoming increasingly popular but remain complex to interpret.
The tool offers multiple interpretability functions, including:
Raw attention visualization to understand how the model attends to image patches and text tokens.
Relevancy maps to identify the most relevant parts of the input image for the generated output.
Causal interpretation using the CLEANN method to explain the model's reasoning by identifying the minimal set of input tokens that influence a specific output token.
The authors demonstrate a case study using the LLaVA model on the Multimodal Visual Patterns (MMVP) dataset, highlighting instances where the model prioritizes text input over image content, leading to inconsistent responses. They also show examples where the model's accuracy remains unaffected by text variations due to stronger relevance to the image.
The authors conclude by discussing future directions, including consolidating the multiple interpretability methods for a more comprehensive metric to explain the reasoning behind model responses.
Stats
LLaVA-v1.5-7b model was used in the case study.