แนวคิดหลัก
This paper introduces an efficient and flexible text-to-3D generation framework that leverages a novel volumetric representation to enable fine-grained control over object characteristics through textual cues.
บทคัดย่อ
The paper presents a two-stage framework for text-to-3D generation:
Volume Encoding Stage:
- The authors propose a lightweight network that can efficiently convert multi-view images into a feature volume representation, bypassing the expensive per-object optimization process required by previous methods.
- This efficient encoder can process 30 objects per second on a single GPU, allowing the authors to acquire 500K models within hours.
- The localized feature volume representation enables flexible interaction with text prompts at the fine-grained object part level.
Diffusion Modeling Stage:
- The authors train a text-conditioned diffusion model on the acquired feature volumes using a 3D U-Net architecture.
- To address the challenges posed by high-dimensional feature volumes and inaccurate object captions in datasets, the authors develop several key designs:
- A new noise schedule that shifts towards larger noise to effectively corrupt information in high-dimensional spaces.
- A low-frequency noise strategy that introduces additional information corruption adjustable to the resolution.
- A caption filtering method to remove noisy captions and improve the model's understanding of the text-3D relationship.
The proposed framework demonstrates promising results in producing diverse and recognizable 3D samples from text prompts, with superior control over object part characteristics compared to previous methods.
สถิติ
The authors' volume encoder can process 30 objects per second on a single GPU.
The authors were able to acquire 500K 3D models within hours using their efficient encoder.
The authors' diffusion model was trained on a subset of 100K text-object pairs from the Objaverse dataset after filtering out low-quality captions.
คำพูด
"The proposed volume encoder is highly efficient for two primary reasons. Firstly, it is capable of generating a high-quality 3D volume with 32 or fewer images once it is trained. This is a significant improvement over previous methods, which require more than 200 views for object reconstruction. Secondly, our volume encoder can encode an object in approximately 30 milliseconds using a single GPU."
"We theoretically analyze the root of this problem. Considering a local patch on the image consisting of M = w×h×c values, denoted as x0 = {xi
0}M
i=1. Without loss of generality, we assume that {xi
0}M
i=1 are sampled from Gaussian distribution N(0, 1). With common strategy, we add i.i.d. Gaussian noise {ϵi}M
i=1 ∼N(0, 1) to each value by xi
t = √γtxi
0+√1 −γtϵi to obtain the noised sample, where γt indicates the noise level at timestep t. Thus the expected mean L2 perturbation of the patch is E[1/M ∑M
i=0 xi
0 −xi
t]2 = 2/M (1 −√γt). As the resolution M increases, the i.i.d. noises added to each value collectively have a minimal impact on the patch's appearance, and the disturbance is reduced significantly."