Kernkonzepte
A novel method for robust scene change detection that leverages the feature extraction capabilities of a visual foundation model (DINOv2) and integrates full-image cross-attention to effectively handle viewpoint variations between image pairs.
Zusammenfassung
The authors present a novel approach to scene change detection (SCD) that leverages the robust feature extraction capabilities of a visual foundational model, DINOv2, and integrates full-image cross-attention to address key challenges such as varying lighting, seasonal variations, and viewpoint differences.
Key highlights:
- The method freezes the backbone network during training to retain the generality of dense foundation features, enhancing the reliability of change detection.
- The full-image cross-attention mechanism is employed to better tackle the viewpoint variations between image pairs.
- Extensive evaluations are performed on the VL-CMU-CD and PSCD datasets, including newly created viewpoint-varied versions, demonstrating significant improvements in F1-score, particularly in scenarios involving geometric changes.
- Detailed ablation studies validate the contributions of each component in the architecture, including the choice of backbone and the effectiveness of the cross-attention module.
- The results indicate the method's superior generalization capabilities over existing state-of-the-art approaches, showing robustness against photometric and geometric variations as well as better overall generalization when fine-tuned to adapt to new environments.
Statistiken
The VL-CMU-CD dataset consists of 933 coarsely aligned image pairs in the training set and 429 in the test set.
The PSCD dataset has 11,550 aligned image pairs from 770 panoramic image pairs.
The authors create unaligned datasets from VL-CMU-CD and PSCD by using adjacent neighbor image pairs to simulate real-world viewpoint variations.
Zitate
"We present a novel method for scene change detection that leverages the robust feature extraction capabilities of a visual foundational model, DINOv2, and integrates full-image cross-attention to address key challenges such as varying lighting, seasonal variations, and viewpoint differences."
"By effectively managing correspondences between image pairs, our method outperformed existing approaches and proved particularly effective in scenarios involving geometric changes."