A novel point prompting framework (PPT) that leverages curriculum learning and object-centric images to enable effective integration of frozen CLIP and SAM models for weakly-supervised referring image segmentation.
By progressively incorporating target-related textual cues from the input description, the proposed Progressive Comprehension Network (PCNet) enhances visual-linguistic alignment for accurate weakly-supervised referring image segmentation.