The proposed method, called MaGRITTe, generates 3D scenes by integrating partial images, layout information represented in the top view, and text prompts as input conditions. This approach aims to compensate for the shortcomings of each condition in a complementary manner, making it easier to create the 3D scenes intended by the creator.
The method comprises four steps:
The proposed method avoids the need for creating large datasets by fine-tuning a pre-trained text-to-image model and generating 3D scenes from 2D images. It also addresses the integration of different modalities by converting the input information into a common ERP format and embedding it in the same latent space.
Experimental results on indoor and outdoor scene datasets demonstrate that the proposed method can generate 3D scenes with controlled appearance, geometry, and overall context based on the input conditions, even beyond the dataset used for fine-tuning.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania