Grunnleggende konsepter
Pre-trained vision-language models can learn and encode a diverse set of generic and visually discriminative concepts that can be effectively utilized for interpretable and generalizable visual recognition.
Sammendrag
The paper investigates whether pre-trained vision-language models (VLMs) can learn and encode primitive visual concepts, such as color, texture, and shape, that can be leveraged for interpretable and generalizable visual recognition.
The authors first observe that previous works on extracting visual concepts from VLMs often rely on category-specific prompts, which can lead to "shortcuts" and biased conclusions about the models' true concept learning capabilities. To address this issue, the authors propose a new Concept Discovery and Learning (CDL) framework that discovers category-agnostic and visually discriminative concepts from a large image-caption dataset, leveraging both the visual and language knowledge encoded in pre-trained VLMs and language models (LLMs).
The key steps of the CDL framework are:
Extracting a preliminary list of generic visual concepts from image captions using dependency parsing and LLM-generated prompts.
Ranking and selecting the most visually discriminative concepts by measuring the mutual information between the VLM's visual recognition and the LLM's language-based concept relevance.
Fine-tuning the last layers of the pre-trained VLM to better align the concept activations with the selected concepts, in a self-supervised manner.
The authors then propose a suite of quantitative and human evaluation protocols to measure the interpretability, precision, thoroughness, and generalizability of the discovered concepts. Extensive experiments on six diverse visual recognition benchmarks demonstrate that the concepts discovered and learned by the CDL framework are indeed precise, thorough, and generalizable, providing strong evidence that pre-trained VLMs can effectively encode visual concepts.
The authors also show that the concepts discovered by CDL outperform previous state-of-the-art methods in both full-shot and few-shot visual recognition tasks, highlighting the practical benefits of the learned visual concepts.
Statistikk
Pre-trained vision-language models can learn visual concepts that are precise, thorough, and generalizable.
The discovered concepts outperform previous state-of-the-art methods in both full-shot and few-shot visual recognition tasks.
The CDL framework can discover category-agnostic and visually discriminative concepts from a large image-caption dataset.
Sitater
"Do vision-language models (VLMs) pre-trained to caption an image of a durian learn visual concepts such as brown (color) and spiky (texture) at the same time?"
"Visual concepts such as color, shape, and texture help models generalize compositionally, and can be incorporated into neuro-symbolic frameworks or offer concept-based explanations for classification decisions."
"We observe that recent works prompting VLMs with concepts often differ in their strategies to define and evaluate the visual concepts, leading to conflicting conclusions."