Core Concepts
Depth inpainting is crucial for generating geometrically consistent 3D scenes from a single image or text prompt.
Abstract
The paper introduces two key contributions to the field of 3D scene generation:
A novel depth completion model that learns to predict depth maps conditioned on the existing scene geometry, resulting in improved geometric coherence of the generated scenes. This model is trained in a self-supervised manner using teacher distillation and self-training.
A new benchmarking scheme for evaluating the geometric quality of scene generation methods, based on ground truth depth data. This allows assessing the consistency and accuracy of the generated 3D structure, going beyond visual quality metrics.
The authors show that existing scene generation methods suffer from geometric inconsistencies, which are uncovered by the proposed benchmark. Their depth inpainting model significantly outperforms prior approaches in terms of geometric fidelity, while also maintaining high visual quality.
The pipeline first uses a generative model like Stable Diffusion to hallucinate new scene content beyond the initial input. It then leverages the depth inpainting model to predict depth maps that are consistent with the existing scene geometry, seamlessly integrating the new content. Additional support views are generated to further constrain the scene and fill in occluded regions. Finally, the point cloud representation is converted to a smooth Gaussian splat optimization to produce the final 360-degree scene.
Stats
The paper does not provide specific numerical data, but rather focuses on qualitative and benchmarking results.
Quotes
"We note that lifting images to 3D with a monocular depth estimation model is suboptimal as it ignores the geometry of the existing scene."
"We thus introduce a novel depth completion model, trained via teacher distillation and self-training to learn the 3D fusion process, resulting in improved geometric coherence of the scene."