แนวคิดหลัก
Leveraging language priors in a variational framework to improve metric-scale monocular depth estimation.
บทคัดย่อ
The paper proposes a variational framework, termed WorDepth, that leverages complementary strengths of two inherently ambiguous modalities - camera images and text descriptions - for monocular depth estimation.
Key highlights:
- Encodes text captions as a mean and standard deviation to learn the distribution of plausible metric reconstructions of 3D scenes.
- Introduces an image-based conditional sampler that models the use of language as a conditional prior to "select" the most probable depth map from the learned latent distribution.
- Achieves state-of-the-art performance on indoor (NYU Depth V2) and outdoor (KITTI) depth estimation benchmarks.
- Demonstrates that language can consistently improve depth estimation performance by providing priors about object sizes and scene layouts.
- Explores the sensitivity of the model to different ratios of alternating optimization between the text-VAE and conditional sampler.
- Shows improved zero-shot generalization to unseen datasets by leveraging the transferability of language priors.
สถิติ
The paper does not provide any specific numerical data or statistics in the main text. The key results are reported in the form of quantitative evaluation metrics on benchmark datasets.