TAG proposes a novel approach for open-vocabulary semantic segmentation without the need for training, annotation, or guidance.
A simple, yet extremely effective, training-free technique, Plug-and-Play Open-Vocabulary Semantic Segmentation (PnP-OVSS), can extract accurate open-vocabulary semantic segmentation from off-the-shelf vision-language models without any additional training or dense annotations.
The proposed Generalization Boosted Adapter (GBA) enhances the generalization and robustness of vision-language models for open-vocabulary segmentation tasks by diversifying feature styles and suppressing spurious correlations.
A novel method, PixelCLIP, to adapt the CLIP image encoder for pixel-level understanding by guiding the model on where, achieved using unlabeled images and masks generated from vision foundation models.
OVDiff, a novel method that leverages generative text-to-image diffusion models to synthesize on-demand efficient segmentation algorithms for arbitrary textual categories, enabling open-vocabulary semantic segmentation without the need for further training or data collection.