The paper introduces CRFormer, a model that iteratively calibrates multi-modal features in the transformer decoder to address the challenge of efficiently propagating fine-grained semantic information from textual features to visual features in referring image segmentation.
The key highlights are:
The authors generate multiple language queries representing various emphases and detailed semantic information to mitigate natural distortion during the decoder propagation process.
They design a novel Calibration Decoder (CDec) that can continuously calibrate the language information by generating new language queries in each decoder layer.
They introduce a Language Reconstruction Module and a reconstruction loss to evaluate the distortion degree of language information after continuous correction, which can further prevent the language information from being lost or distorted.
The experiments show that the proposed CRFormer achieves new state-of-the-art results on three referring image segmentation datasets, demonstrating the effectiveness of the deep integrated language approach.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Yichen Yan,X... at arxiv.org 04-15-2024
https://arxiv.org/pdf/2404.08281.pdfDeeper Inquiries