Konsep Inti
A training-free framework for open-vocabulary semantic segmentation that constructs well-aligned intra-modal reference features and conducts relation-aware matching to achieve robust region classification.
Abstrak
The content presents a novel training-free framework called Relation-aware Intra-modal Matching (RIM) for open-vocabulary semantic segmentation (OVS). The key ideas are:
- Intra-modal Reference Construction:
- The authors leverage the Stable Diffusion (SD) model and Segment Anything Model (SAM) to generate category-specific reference images and corresponding foreground masks.
- The reference features are then extracted in the all-purpose feature space of DINOv2, which exhibits better alignment compared to cross-modal features.
- Relation-aware Matching:
- The authors propose a relation-aware matching strategy based on ranking distribution, which captures the structure information implicit in inter-class relationships.
- This enables more robust region classification compared to individual region-reference comparison.
The authors conduct extensive experiments on three benchmark datasets and demonstrate that RIM significantly outperforms previous state-of-the-art methods by large margins, achieving over 10% mIoU improvement on the PASCAL VOC dataset.
Statistik
The authors report the following key metrics:
On PASCAL VOC dataset, RIM achieves 77.8% mIoU, outperforming the previous state-of-the-art by over 10%.
On PASCAL Context dataset, RIM achieves 34.3% mIoU, surpassing the previous best by 8%.
On COCO Object dataset, RIM achieves 44.5% mIoU, improving over the previous state-of-the-art by 6.6%.
Kutipan
"We attribute this to the natural gap between the highly abstract and monotonous category textual features and the visual features that are more concrete and diverse."
"The ranking permutation reflects the relevance of the corresponding categories w.r.t. the region feature. An agent-ranking probability distribution can be constructed by associating the probability with every rank permutation for both the region feature and all category reference features."