Основные понятия
By progressively incorporating target-related textual cues from the input description, the proposed Progressive Comprehension Network (PCNet) enhances visual-linguistic alignment for accurate weakly-supervised referring image segmentation.
Аннотация
This paper explores the weakly-supervised referring image segmentation (WRIS) problem, where the target object is localized directly from image-text pairs without pixel-level ground-truth masks.
The key insights are:
- Humans often follow a step-by-step comprehension process to identify the target object, progressively utilizing target-related attributes and relations as cues.
- Existing WRIS methods encode the entire text description as a single embedding, overlooking critical target-related cues.
To address these issues, the authors propose the Progressive Comprehension Network (PCNet):
- It first uses a Large Language Model (LLM) to decompose the input text description into short phrases as target-related cues.
- These cues are then fed into a novel Conditional Referring Module (CRM) in multiple stages to update the referring text embedding and enhance the response map for target localization.
- A Region-aware Shrinking (RaS) loss is proposed to constrain the visual localization to be conducted progressively in a coarse-to-fine manner across different stages.
- An Instance-aware Disambiguation (IaD) loss is introduced to suppress instance localization ambiguity by differentiating overlapping response maps generated by different referring texts on the same image.
Extensive experiments on three benchmarks show that PCNet outperforms state-of-the-art WRIS methods by a significant margin.
Статистика
"a player wearing a blue and gray uniform catches a ball"
"a player"
"blue and gray uniform"
"catches a ball"
Цитаты
"Inspired by the human comprehension process, we propose in this paper a novel Progressive Comprehension Network (PCNet) for WRIS."
"We first employ a Large Language Model (LLM) to dissect the input text description into multiple short phrases. These decomposed phrases are considered as target-related cues and fed into a novel Conditional Referring Module (CRM), which helps update the global referring embedding and enhance target localization in a multi-stage manner."
"We also propose a novel Region-aware Shrinking (RaS) loss to facilitate visual localization across different stages at the region level."
"Finally, we introduce an Instance-aware Disambiguation (IaD) loss to reduce the overlapping of the response maps by rectifying the alignment score of different referring texts to the same object."