Core Concepts
XFeat is a novel lightweight convolutional neural network architecture that performs fast and robust local feature extraction, enabling efficient visual correspondence for resource-constrained devices.
Abstract
The paper introduces XFeat, a lightweight and accurate architecture for efficient visual correspondence. The key highlights are:
XFeat is designed to be a hardware-agnostic, versatile solution that can perform both sparse keypoint-based matching and semi-dense pixel-level matching. This flexibility allows it to be suitable for a wide range of applications, from visual localization to pose estimation and 3D reconstruction.
To significantly reduce the computational footprint, the authors propose a novel strategy to minimize the channel depth in early convolutional layers while tripling the channel count as the spatial resolution decreases. This effectively redistributes the network's convolutional depth, leading to a substantial speedup of up to 5x compared to existing lightweight deep learning solutions.
XFeat features a minimalist, learnable keypoint detection branch that is fast and suitable for small extractor backbones. This design choice proves effective for visual localization, camera pose estimation, and homography registration tasks.
The paper introduces a novel match refinement module that can recover pixel-level offsets from coarse semi-dense matches, without requiring high-resolution feature maps. This lightweight approach greatly reduces the compute and memory requirements while achieving high accuracy and matching density.
Extensive experiments on relative pose estimation, visual localization, and homography estimation demonstrate that XFeat can outperform state-of-the-art deep learning-based local feature methods in terms of speed, while maintaining comparable or better accuracy. The authors showcase XFeat running in real-time on an inexpensive laptop CPU without specialized hardware optimizations.
Stats
The paper reports the following key metrics:
Frames per second (FPS) on a budget-friendly laptop (Intel(R) i5-1135G7 @ 2.40GHz CPU) at VGA resolution:
XFeat: 27.1 ± 0.33 FPS
XFeat*: 19.2 ± 1.12 FPS
Relative pose estimation accuracy (Acc@10°) on Megadepth-1500 dataset:
XFeat: 74.9%
XFeat*: 85.1%
Homography estimation Mean Homography Accuracy (MHA) on HPatches dataset:
Illumination split: 95.0% (@ 3 pixels)
Viewpoint split: 81.1% (@ 5 pixels)
Quotes
"XFeat is versatile and hardware-independent, surpassing current deep learning-based local features in speed (up to 5x faster) with comparable or better accuracy, proven in pose estimation and visual localization."
"Our new model satisfies a critical need for fast and robust algorithms suitable to resource-limited devices."