The paper presents a novel modality-adaptive Transformer (MAT) for the task of arbitrary modality salient object detection (AM SOD). AM SOD aims to detect salient objects from inputs with arbitrary modalities, such as RGB, depth, and thermal images.
The key components of MAT are:
Modality-Adaptive Feature Extractor (MAFE): MAFE takes an image of arbitrary modality along with a corresponding modality prompt as input. It uses the modality prompt to adaptively adjust the feature space to extract discriminative unimodal features. A novel modality translation contractive (MTC) loss is designed to learn modality-distinguishable prompts.
Channel-wise and Spatial-wise Fusion Hybrid (CSFH) Strategy: CSFH dynamically fuses the unimodal features from an arbitrary number of modalities. It employs a spatial-wise dynamic fusion module (SDFM) and a channel-wise dynamic fusion module (CDFM) to capture complementary detail and semantic information across modalities, respectively. CSFH aligns SDFM and CDFM to different levels of unimodal features based on their characteristics.
The proposed MAT can effectively handle the challenges of AM SOD, i.e., diverse modality discrepancies and dynamic fusion, and achieves significant performance improvements over existing models on the AM-XD benchmark dataset.
לשפה אחרת
מתוכן המקור
arxiv.org
תובנות מפתח מזוקקות מ:
by Nianchang Hu... ב- arxiv.org 05-07-2024
https://arxiv.org/pdf/2405.03351.pdfשאלות מעמיקות