Core Concepts
Vision-language models (VLMs) face challenges in open-set recognition due to closed-set assumptions, impacting performance and vulnerability.
Abstract
The content discusses the limitations of vision-language models (VLMs) in open-set recognition. It highlights the closed-set assumptions imposed by VLMs, leading to misclassifications and low precision. The paper introduces a revised definition of the open-set problem for VLMs, proposes a new benchmark for evaluation, and evaluates baseline approaches for open-set recognition. Experiments reveal poor performance of state-of-the-art VLM classifiers and object detectors in open-set conditions. Negative embeddings are explored as a potential solution, showing trade-offs between reducing open-set errors and maintaining closed-set accuracy. The impact of query set size on performance is also analyzed.
1. Introduction
Closed-set assumption ingrained in vision models.
Open-set conditions challenge model assumptions.
Importance of evaluating models for open-set recognition.
2. Background
Vision-language models revolutionize image classification.
Foundation models trained on internet-scale datasets.
VLMs adapt well to zero-shot classification tasks.
3. Problem Definition
Mapping images and text into joint embedding space.
Closed-set vs. open-set assumptions in VLMs.
Baseline approaches for open-set recognition with VLMs.
4. Evaluation Protocol
Creating an open-set recognition dataset for VLMs.
Metrics used to evaluate performance.
Testing different VLM classifiers and object detectors.
5. Experiments and Results
State-of-the-art VLMs perform poorly in open-set conditions.
Impact of negative queries on performance.
Correlation between closed-set and open-set performance.
Quotes
"We answer this question with a clear no – VLMs introduce closed-set assumptions via their finite query set."
"Open vocabulary object detection requires detectors that can generalize to an arbitrary set of object classes at test time."