The paper proposes a method called DisDepth for efficient monocular depth estimation (MDE). It consists of the following key components:
A simple and efficient CNN-based MDE framework with a backbone encoder and a simple decoder. This framework achieves competitive performance compared to complex transformer-based models.
A local-global convolution (LG-Conv) module that enhances the global representation capability of the CNN backbone without significantly increasing computational cost.
A cross-architecture knowledge distillation (KD) approach that adapts the transformer teacher model to be more student-friendly. This is achieved by introducing a "ghost decoder" that aligns the teacher features with the student's decoder.
An attentive KD loss that identifies valuable regions in the teacher features and guides the student to focus on these regions during distillation.
Extensive experiments on the KITTI and NYU Depth V2 datasets demonstrate that DisDepth achieves significant efficiency and performance improvements over existing state-of-the-art MDE methods, especially on resource-constrained devices.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Zhimeng Zhen... at arxiv.org 04-26-2024
https://arxiv.org/pdf/2404.16386.pdfDeeper Inquiries