The paper focuses on the task of automated illicit object detection in X-ray images, which is crucial for public safety at critical terminals like airports, train stations, and ports. It evaluates various deep neural architectures, including Transformer-based and hybrid models, on this task.
The key highlights are:
The paper compares the performance of SWIN Transformer, NextViT (a hybrid CNN-Transformer model), and YOLO-based detectors on the SIXray and CFray datasets.
The results show that the DINO Transformer detector achieves remarkable accuracy in the low-data regime of the CFray dataset.
YOLOv8 combined with the NextViT backbone demonstrates impressive real-time inference speed, while maintaining competitive accuracy.
The hybrid NextViT backbone proves effective, outperforming the pure Transformer SWIN in terms of the stricter mAP@50-95 metric on the SIXray dataset.
The paper suggests that end-to-end Transformer architectures may face challenges in the highly specific X-ray domain, compared to hybrid solutions that also employ convolutions. However, Transformers can be surprisingly accurate in low-data regimes.
Future research directions may involve combining X-ray-specific neural modules with the evaluated methods to further improve accuracy.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문