Core Concepts
The author proposes a Multi-Grained Cross-modal Alignment framework to bridge the alignment granularity gap and achieve efficient open-vocabulary semantic segmentation from text supervision.
Abstract
The paper introduces a novel framework, MGCA, to address the alignment granularity gap in semantic segmentation. It innovatively constructs multi-granular pseudo semantic correspondences and develops an adaptive semantic unit. The method achieves state-of-the-art performance on various benchmarks without dense annotations.
The content discusses the challenges in learning open-vocabulary semantic segmentation and presents a detailed approach to address them. It highlights the importance of fine-grained alignment at object-, region-, and pixel-levels for accurate predictions. The proposed adaptive semantic unit improves segmentation quality significantly.
Key points include the introduction of MGCA framework, the significance of multi-granular alignment, and the development of an adaptive semantic unit. The paper showcases impressive results on multiple datasets without relying on dense annotations.
Stats
8.7 mIoU on ADE20K dataset with 29 million image-text pairs.
Training solely on CC3M dataset with 4.72M learnable parameters.
Achieved new state-of-the-art zero-shot performance on 8 segmentation benchmarks.
Quotes
"Our method achieves substantial advancements over preceding state-of-the-art methods while utilizing a reduced amount of training data."
"Training solely on CC3M datasets with mere 4.72M learnable parameters, we achieve new state-of-the-art zero-shot performance."