The paper presents Marigold, a method for affine-invariant monocular depth estimation that is derived from the Stable Diffusion text-to-image diffusion model. The key idea is to leverage the extensive visual priors captured in recent generative diffusion models to enable better and more generalizable depth estimation.
The authors first formulate monocular depth estimation as a conditional denoising diffusion generation task, where the goal is to model the conditional distribution of depth given an input image. They then introduce a fine-tuning protocol to adapt the pre-trained Stable Diffusion model for this task, by modifying and fine-tuning only the denoising U-Net component while keeping the latent space and VAE encoder-decoder intact.
The proposed method, Marigold, is trained exclusively on synthetic depth datasets, which provide clean and dense ground truth depth maps. Despite this, the model exhibits excellent zero-shot generalization to a wide range of real-world datasets, outperforming several state-of-the-art monocular depth estimation methods. The authors attribute this to the rich visual priors captured in the pre-trained diffusion model.
The paper also presents several ablation studies, investigating the impact of training noise, dataset domain, test-time ensembling, and the number of denoising steps. The results demonstrate the effectiveness of the proposed fine-tuning approach and the importance of the underlying diffusion-based visual priors.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Bingxin Ke,A... at arxiv.org 04-04-2024
https://arxiv.org/pdf/2312.02145.pdfDeeper Inquiries