The paper presents an innovative framework, Point PrompTing (PPT), for weakly-supervised referring image segmentation (RIS). The core of PPT is a point generator that harnesses CLIP's text-image alignment capability and SAM's powerful mask generation ability, while also generating negative point prompts to address the issues of noisy and excessive focus on object parts.
To address these challenges, the authors introduce a curriculum learning strategy that progressively transitions from class-based segmentation to complex referring image segmentation, incorporating factors such as location and relationships. Additionally, the authors leverage object-centric images from ImageNet to help the point generator learn semantic-aware and comprehensive point prompts, as opposed to merely salient ones.
The authors' experiments demonstrate that their PPT significantly and consistently outperforms prior weakly supervised RIS techniques, achieving an average mIoU improvement of 11.34%, 14.14%, and 6.97% on RefCOCO, RefCOCO+, and G-Ref datasets, respectively. The authors also show that their approach significantly improves precision at different IoU thresholds compared to other weakly supervised methods.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Qiyuan Dai,S... at arxiv.org 04-19-2024
https://arxiv.org/pdf/2404.11998.pdfDeeper Inquiries