Concetti Chiave
Using ViT embeddings improves monocular depth estimation.
Sintesi
Abstract
Learning-based single image depth estimation relies on shading and contextual cues.
ViT embeddings provide detailed contextual information for depth estimation.
Proposed model achieves state-of-the-art performance on NYU Depth v2 and KITTI datasets.
Introduction
Single Image Depth Estimation (SIDE) is crucial for various applications.
Metric and relative depth estimation techniques are used.
Learning-based models rely on visual cues for depth prediction.
Data-Driven Approach
Models overfit on specific training data distributions.
Training on multiple datasets with varied depth ranges is proposed.
Foundational Models
Pre-trained models like ViT improve generalization and zero-shot transfer.
Comparison with existing works like VPD and TADP.
Proposed Methodology
CIDE module extracts semantic context using ViT embeddings.
Conditional diffusion model for depth prediction.
Experiments and Results
Evaluation on NYU Depth v2 and KITTI datasets.
Generalization and zero-shot transfer performance.
Ablation study on the effectiveness of contextual information.
Conclusion
CIDE module enhances monocular depth estimation.
ViT embeddings outperform text embeddings for depth prediction.
Statistiche
Achieving Abs Rel error of 0.059 (14% improvement) on NYU Depth v2 dataset.
Achieving Sq Rel error of 0.139 (2% improvement) on KITTI dataset.
Mean relative improvement of (20%, 23%, 81%, 25%) over NeWCRF on (Sun-RGBD, iBims1, DIODE, HyperSim) datasets.
Citazioni
"ViT embeddings provide more relevant information for depth estimation than pseudo-captions."
"Our model outperforms existing approaches on benchmark datasets."