核心概念
The core message of this paper is to introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed to effectively address the challenges of Referring Remote Sensing Image Segmentation (RRSIS), which involves segmenting specific areas from aerial images based on textual descriptions.
摘要
The paper presents a new large-scale benchmark dataset, RRSIS-D, designed for the RRSIS task. RRSIS-D comprises 17,402 remote sensing image-caption-mask triplets, offering a significant improvement in scale and diversity compared to previous datasets.
The authors propose the Rotated Multi-Scale Interaction Network (RMSIN) to tackle the unique challenges of RRSIS, which include complex spatial scales and object orientations in aerial imagery. RMSIN incorporates an Intra-scale Interaction Module (IIM) to effectively extract fine-grained details at multiple scales, and a Cross-scale Interaction Module (CIM) to integrate these details coherently across the network. Additionally, RMSIN employs an Adaptive Rotated Convolution (ARC) in the decoder to account for the diverse orientations of objects, a novel contribution that significantly enhances segmentation accuracy.
Experimental evaluations on the RRSIS-D dataset demonstrate the exceptional performance of RMSIN, surpassing existing state-of-the-art models by a significant margin. The authors also conduct extensive ablation studies to validate the effectiveness of the proposed components, including the IIM, CIM, and ARC.
統計資料
"The visual backbone utilizes Swin Transformer [31], pre-trained on ImageNet22K [4], while the language backbone employs the base BERT model from HuggingFace's library [49]."
"The model is trained for 40 epochs using AdamW [32] with a weight decay of 0.01 and a starting learning rate of 3e-5, reducing according to polynomial decay."
引述
"To address these challenges, we introduce the Rotated Multi-Scale Interaction Network (RMSIN), an innovative approach designed for the unique demands of RRSIS."
"RMSIN incorporates an Intra-scale Interaction Module (IIM) to effectively address the fine-grained detail required at multiple scales and a Cross-scale Interaction Module (CIM) for integrating these details coherently across the network."
"Furthermore, RMSIN employs an Adaptive Rotated Convolution (ARC) to account for the diverse orientations of objects, a novel contribution that significantly enhances segmentation accuracy."