Core Concepts
The proposed Image-Text Co-Decomposition (CoDe) framework jointly decomposes image-text pairs into corresponding regions and word segments, enabling direct region-word alignment and alleviating the discrepancy between training and testing for text-supervised semantic segmentation.
Abstract
The paper addresses the task of text-supervised semantic segmentation, which aims to segment arbitrary visual concepts within images using only image-text pairs without dense annotations.
The key insights are:
Existing methods have demonstrated that contrastive learning on image-text pairs can effectively align visual segments with the meanings of texts. However, there is a discrepancy between text alignment and semantic segmentation, as a text often consists of multiple semantic concepts, whereas semantic segmentation strives to create semantically homogeneous segments.
To address this issue, the authors propose a novel framework called Image-Text Co-Decomposition (CoDe), where the paired image and text are jointly decomposed into a set of image regions and a set of word segments, respectively, and contrastive learning is developed to enforce region-word alignment.
The authors also present a prompt learning mechanism that derives an extra representation to highlight an image segment or a word segment of interest, which helps extract more effective features from that segment.
Comprehensive experimental results demonstrate that the proposed CoDe framework performs favorably against existing text-supervised semantic segmentation methods on six benchmark datasets.
Stats
The paper does not provide any specific numerical data or statistics. The focus is on the proposed framework and its performance evaluation.