The paper introduces SlowTV and CribsTV datasets curated from YouTube videos to enhance self-supervised monocular depth estimation. These datasets offer diverse environments, improving generalization capabilities. The models leverage innovative components like camera intrinsics learning and advanced augmentation strategies for superior performance.
Existing self-supervised monocular depth estimation models struggle with limited training data diversity, hindering their generalization beyond specific domains. The proposed datasets address this limitation by providing a wide range of scenes, including natural, urban, and indoor environments. The models trained on these datasets achieve impressive results in zero-shot generalization tasks.
The study showcases the importance of diverse training data in enhancing the performance of self-supervised computer vision systems. By leveraging publicly available video content, the research demonstrates significant advancements in monocular depth estimation without relying on ground-truth annotations.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Jaime Spence... lúc arxiv.org 03-05-2024
https://arxiv.org/pdf/2403.01569.pdfYêu cầu sâu hơn