The paper proposes a self-supervised monocular depth estimation network that aims to improve the accuracy and detail recovery of depth maps. The key contributions are:
A depth decoder based on large kernel attention (LKA) that can model long-distance dependencies without compromising the 2D structure of features, while maintaining feature channel adaptivity. This helps the model capture more accurate context information and process complex scenes better.
An upsampling module that can accurately recover fine details in the depth map, reducing blurred edges compared to simple bilinear interpolation.
The proposed network is evaluated on the KITTI dataset and achieves competitive results, outperforming various CNN-based and Transformer-based self-supervised monocular depth estimation methods. Qualitative results show the proposed method can generate depth maps with sharper edges and better distinguish boundaries compared to previous approaches.
Ablation studies demonstrate the effectiveness of the LKA decoder and upsampling module, with the full model achieving the best performance in terms of error metrics (AbsRel, SqRel, RMSE, RMSElog) and accuracy metrics (δ1, δ2, δ3). The model also maintains efficiency with no significant increase in parameters or computational cost compared to the baseline.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Xuezhi Xiang... a las arxiv.org 09-27-2024
https://arxiv.org/pdf/2409.17895.pdfConsultas más profundas