The Idea-2-3D framework addresses the challenge of generating 3D content from high-level, abstract multimodal inputs called IDEAs. It integrates three LMM-based agents and several off-the-shelf tools to transform IDEAs into tangible 3D models.
The process begins with the LMM agent 1 converting the multimodal IDEA into text prompts for 3D model generation. These prompts are then used by Text-to-Image (T-2-I) and Image-to-3D (I-2-3D) models to create draft 3D models.
The LMM agent 5 selects the best draft 3D model based on its fidelity and relevance to the IDEA. If further refinement is needed, the LMM agent 6 generates textual feedback to guide enhancements. This feedback is used by agent 1 to create revised prompts for the next iteration.
The framework also includes a memory module that stores feedback, selected draft 3D models, and corresponding text prompts from previous iterations. This enables the LMM agents to leverage past experiences and insights to optimize future outputs.
Comprehensive user studies demonstrate the superiority of Idea-2-3D over caption-based baselines, with users preferring Idea-2-3D models in 94.2% of cases and finding them to be 2.3 times more satisfying to the IDEA requirements.
Başka Bir Dile
kaynak içeriğinden
arxiv.org
Önemli Bilgiler Şuradan Elde Edildi
by Junhao Chen,... : arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.04363.pdfDaha Derin Sorular