A straightforward adaptation of CLIP that enforces localization of patches in the self-attention, significantly improving performance on open-vocabulary semantic segmentation without requiring additional data, auxiliary pre-trained networks, or extensive hyperparameter tuning.
A novel cost-based approach to effectively adapt the CLIP vision-language model for open-vocabulary semantic segmentation, achieving state-of-the-art performance.
GOV-NeSF, a novel approach that offers a generalizable implicit representation of 3D scenes with open-vocabulary semantics, exhibits state-of-the-art performance in both 2D and 3D open-vocabulary semantic segmentation without requiring ground truth semantic labels or depth priors, and effectively generalizes across scenes and datasets without fine-tuning.
A training-free framework for open-vocabulary semantic segmentation that constructs well-aligned intra-modal reference features and conducts relation-aware matching to achieve robust region classification.