Core Concepts
The author proposes the DAMS-DETR model to address challenges in infrared-visible object detection by dynamically selecting modality features and adaptively fusing complementary information.
Abstract
The DAMS-DETR model addresses challenges in infrared-visible object detection by proposing a Modality Competitive Query Selection strategy and a Multispectral Deformable Cross-attention module. The method outperforms state-of-the-art models on various datasets, demonstrating its effectiveness in handling complex scenes and misalignment issues.
The paper discusses the importance of infrared-visible object detection due to its ability to capture objects under challenging conditions like low illumination or smoke. It introduces the DAMS-DETR model based on DETR, aiming to fuse complementary information from infrared and visible images effectively.
Key components of DAMS-DETR include Modality Competitive Query Selection for dynamic feature selection and a Multispectral Deformable Cross-attention module for adaptive feature fusion. Experiments on different datasets show significant improvements over existing methods.
The study highlights the challenges of modality interference and misalignment in infrared-visible object detection, emphasizing the need for adaptive strategies like those proposed in DAMS-DETR. The model's performance is evaluated across multiple scenarios, showcasing its robustness and efficiency.
Stats
Experiments on four public datasets demonstrate significant improvements compared to other state-of-the-art methods.
Our method achieves better results with mAP50: 80.2%, mAP75: 56.0%, mAP: 52.9% on the M3FD dataset.
On the FLIR-aligned dataset, our method achieves mAP50: 86.6%, mAP75: 48.1%, mAP: 49.3%.
Results on the LLVIP dataset show mAP50: 97.9%, mAP75: 79.1%, mAP: 69.6%.
For the VEDAI dataset, our method achieves mAP50: 91.5%, mAP: 55.3%.
Quotes
"The one with the green bounding box has good complementary information in both modalities."
"Some works learn global fusion weight to adapt to specific scenes."
"Our method can adaptively focus on dominant modalities and effectively mine fine-grained multi-level semantic complementary information."