Core Concepts
The proposed method generates 3D scenes by integrating partial images, layout information represented in the top view, and text prompts as input conditions in a complementary manner, addressing the limitations of existing methods that rely on a single condition.
Abstract
The proposed method, called MaGRITTe, generates 3D scenes by integrating partial images, layout information represented in the top view, and text prompts as input conditions. This approach aims to compensate for the shortcomings of each condition in a complementary manner, making it easier to create the 3D scenes intended by the creator.
The method comprises four steps:
Conversion of partial images and layouts to equirectangular projection (ERP) format, which allows for the integration of different modalities.
Generation of a 360-degree RGB image by fine-tuning a pre-trained text-to-image model with a small artificial dataset of partial images and layouts.
Estimation of a fine depth map from the generated 360-degree RGB image and the coarse depth map derived from the layout information, using either an end-to-end approach or a depth integration approach.
Training of a NeRF model using the generated 360-degree RGB-D.
The proposed method avoids the need for creating large datasets by fine-tuning a pre-trained text-to-image model and generating 3D scenes from 2D images. It also addresses the integration of different modalities by converting the input information into a common ERP format and embedding it in the same latent space.
Experimental results on indoor and outdoor scene datasets demonstrate that the proposed method can generate 3D scenes with controlled appearance, geometry, and overall context based on the input conditions, even beyond the dataset used for fine-tuning.