Alapfogalmak
Vista3D is a framework that efficiently generates diverse and consistent 3D objects from a single input image by leveraging a coarse-to-fine approach and an angular-based composition of diffusion priors.
Kivonat
The paper presents Vista3D, a framework for generating 3D objects from a single input image. The key aspects of the framework are:
Coarse Geometry Generation:
- Vista3D starts with a coarse geometry generation phase using 3D Gaussian Splatting (3DGS). It employs a Top-K gradient-based densification strategy and introduces scale and transmittance regularization to enhance the reconstructed geometry.
Mesh Refinement and Texture Disentanglement:
- In the refinement stage, Vista3D transforms the coarse geometry into signed distance fields (SDFs) and further refines the geometry and textures using a differentiable isosurface representation (FlexiCubes).
- It introduces a disentangled texture representation that separates the texture into two hash encodings, one for the facing-forward view and one for the back view, to better capture the diversity of unseen views.
Angular-based Diffusion Prior Composition:
- To explore the diversity of the 3D "darkside" while maintaining 3D consistency, Vista3D integrates two diffusion priors (Zero-1-to-3 XL and Stable-Diffusion) and employs an angular-based composition method to constrain their gradient magnitudes.
The framework is able to efficiently generate diverse and consistent 3D objects from a single input image within 5 minutes (Vista3D-S) or 15 minutes (Vista3D-L). Extensive evaluations demonstrate the superior performance of Vista3D compared to existing image-to-3D generation methods.
Statisztikák
Vista3D can generate 3D objects from a single input image within 5 minutes (Vista3D-S) or 15 minutes (Vista3D-L).
Vista3D achieves a CLIP-Similarity score of 0.831 for Vista3D-S and 0.868 for Vista3D-L on the RealFusion dataset, outperforming previous methods.
On the Google Scanned Object (GSO) dataset, Vista3D-L achieves state-of-the-art performance with a PSNR of 26.31, SSIM of 0.929, and LPIPS of 0.062.
Idézetek
"Vista3D excels in efficiently generating diverse and consistent 3D objects from a single image within five minutes."
"Central to Vista3D is a dual-phase strategy: a coarse phase followed by a fine phase."
"We propose an angular composition approach for diffusion priors, constraining their gradient magnitudes to achieve diversity on the 3D darkside without sacrificing 3D consistency."