核心概念
A self-supervised monocular depth estimation network that utilizes large kernel attention to model long-distance dependencies while maintaining feature channel adaptivity, and an upsampling module to accurately recover fine details in the depth map, achieving competitive performance on the KITTI dataset.
摘要
The paper proposes a self-supervised monocular depth estimation network that aims to improve the accuracy and detail recovery of depth maps. The key contributions are:
-
A depth decoder based on large kernel attention (LKA) that can model long-distance dependencies without compromising the 2D structure of features, while maintaining feature channel adaptivity. This helps the model capture more accurate context information and process complex scenes better.
-
An upsampling module that can accurately recover fine details in the depth map, reducing blurred edges compared to simple bilinear interpolation.
The proposed network is evaluated on the KITTI dataset and achieves competitive results, outperforming various CNN-based and Transformer-based self-supervised monocular depth estimation methods. Qualitative results show the proposed method can generate depth maps with sharper edges and better distinguish boundaries compared to previous approaches.
Ablation studies demonstrate the effectiveness of the LKA decoder and upsampling module, with the full model achieving the best performance in terms of error metrics (AbsRel, SqRel, RMSE, RMSElog) and accuracy metrics (δ1, δ2, δ3). The model also maintains efficiency with no significant increase in parameters or computational cost compared to the baseline.
統計資料
Our method achieves AbsRel = 0.095, SqRel = 0.620, RMSE = 4.148, RMSElog = 0.169, δ1 = 90.7 on the KITTI dataset.
Compared to Transformer-based methods MonoVit and MonoFormer, our method outperforms on all metrics, with AbsRel, SqRel, RMSE and RMSElog decreasing by 8.7%, 26.7%, 9.4% and 7.7% respectively, and δ1 increasing by 1.8%.
Compared to CNN-based methods HR-Depth, DIFFNet and RA-Depth, our model also achieves superior performance.
Compared to BDEdepth, which also uses a grid decoder, our method achieves better performance with almost the same parameters.
Compared to MonoVan, which uses VAN as the backbone, our method achieves superior performance with less parameters, with AbsRel, SqRel, RMSE and RMSElog decreasing by 5.9%, 12.2%, 6.1% and 4.0% respectively.
引述
"Our method can model long-distance dependencies without compromising the two-dimension structure of features, and improve estimation accuracy, while maintaining feature channel adaptivity."
"We introduce a up-sampling module to accurately recover the fine details in the depth map and improve the accuracy of monocular depth estimation."