The paper introduces a novel teacher-student model called TSCM for visual place recognition (VPR). The key contributions are:
TSCM employs a cross-metric knowledge distillation (KD) approach that allows the student model to even outperform the teacher model in VPR tasks. This is achieved by aligning the distances between anchor, positive, and negative samples across the teacher and student models, rather than just aligning the output features.
The teacher model in TSCM integrates powerful components from ResNet, Vision Transformer, and Inter-Transformer to achieve superior VPR performance compared to state-of-the-art baselines.
The student model in TSCM is designed to be lightweight, retaining only essential components, while still matching or exceeding the teacher's performance through the proposed cross-metric KD.
Comprehensive evaluations on the Pittsburgh30k and Pittsburgh250k datasets demonstrate that TSCM outperforms baseline methods in terms of recognition accuracy and model parameter efficiency. The student model can compress images into descriptors in 1.3 ms and find a matching in under 0.6 ms per query on a 10k-image database, achieving real-time performance.
翻譯成其他語言
從原文內容
arxiv.org
深入探究