Core Concepts
A novel deep neural network architecture, SelfAttentionVO, that combines convolutional, recurrent, and attention modules to accurately estimate visual odometry from monocular video streams for GPS-free drone navigation.
Abstract
The paper presents a novel deep learning architecture, SelfAttentionVO, for real-time monocular visual odometry estimation. The key innovations are:
- Combining a convolutional neural network to extract visual features, a recurrent neural network to model sequential dependencies, and a multi-head attention module to refine the sequential representations.
- Training the model on a combination of the KITTI and Mid-Air datasets, which provide diverse real-world driving and aerial footage.
- Evaluating the model's performance on the test sets of these datasets and comparing it to the benchmark DeepVO model.
The results show that SelfAttentionVO outperforms DeepVO in several key metrics:
- 22% reduction in mean translational drift
- 40% reduction in mean rotational drift
- 12% improvement in mean translational absolute trajectory error
- 30% improvement in mean rotational absolute trajectory error
Additionally, SelfAttentionVO demonstrates greater robustness to noisy/corrupted input data compared to DeepVO. The paper also presents a real-time inference utility that can process video streams at 15-60 FPS, making it suitable for onboard drone applications.
While the model shows promising results, further research is needed to address challenges like low parallax scenes and vertical motion estimation. Strategies like loop closing and auxiliary task optimization could help improve the model's overall accuracy.
Stats
The mean translational drift of SelfAttentionVO is 62.7% compared to 74.7% for DeepVO.
The mean rotational drift of SelfAttentionVO is 26.6 deg/100m compared to 39.9 deg/100m for DeepVO.
The mean translational absolute trajectory error of SelfAttentionVO is 60.8m compared to 69.1m for DeepVO.
The mean rotational absolute trajectory error of SelfAttentionVO is 80.6 deg compared to 115.1 deg for DeepVO.
Quotes
"SelfAttentionVO's performances are better than the benchmark model. Overall, SelfAttentionVO allows for around 22% reduction in mean translational drift (KITTI Translation Error) and 40% reduction in mean rotational drift (Rotation Error) when calculated on complete trajectories (capped at 1,000 metres)."
"Moreover, the translational fit is improved by about 12% (translation ATE) and the rotational fit is improved by about 30% (rotation ATE)."