Core Concepts
A training-free approach for open-vocabulary semantic segmentation that leverages diffusion-augmented visual prototypes and combines local and global similarities to segment input images.
Abstract
The paper proposes FreeDA, a training-free method for open-vocabulary semantic segmentation. The approach consists of two main stages:
Offline Prototype Generation:
A large set of textual-visual reference embeddings is generated in an offline stage.
Textual captions are used to condition a diffusion model (Stable Diffusion) to generate synthetic images and corresponding localization masks.
The generated images are then used to extract visual prototypes by pooling self-supervised visual features (DINOv2) on the localization masks.
Textual keys are also extracted by embedding the caption words in their textual context.
Inference:
At test time, the input textual categories are used to retrieve the most similar textual keys from the pre-built index.
The corresponding visual prototypes are then used to compute local similarities with class-agnostic superpixel regions extracted from the input image.
Global similarities are also computed using a multimodal encoder (CLIP) to capture the overall semantic correspondence between the image and the input categories.
The local and global similarities are combined to assign each superpixel to the most relevant semantic class.
The experiments show that FreeDA achieves state-of-the-art performance on five open-vocabulary segmentation benchmarks, surpassing previous methods by a large margin without requiring any training.
Stats
The paper does not provide any specific numerical data or statistics. The focus is on the overall methodology and experimental results.
Quotes
There are no direct quotes from the paper that are particularly striking or support the key logics.