The paper introduces a novel "morphable diffusion" model that enhances the state-of-the-art multi-view consistent diffusion approach for the task of creating controllable, photorealistic human avatars. The key idea is to integrate a 3D morphable model into the diffusion pipeline, which allows for accurate conditioning of the generative process on the articulated 3D model. This integration facilitates seamless and accurate incorporation of facial expression and body pose control into the generation process.
The paper first analyzes the applicability of existing multi-view consistent diffusion models for human avatar creation and shows that a simple fine-tuning on limited datasets leads to sub-optimal results. To address this, the proposed morphable diffusion model leverages a well-studied statistical model of human shapes (e.g. SMPL, FLAME) to introduce human priors and guide the reconstruction process.
The model takes a single input image and an underlying morphable model as input, and generates N multi-view consistent novel views. It does this by unprojecting and interpolating the noisy image features onto the mesh vertices, processing them through a sparse 3D ConvNet, and then using a 2D UNet conditioned on the 3D-aware feature volume to predict the denoised images.
The paper further proposes an efficient training scheme that disentangles the reconstruction (guided by the input image) and the animation (guided by the morphable model), enabling the generation of new facial expressions for an unseen subject from a single image.
Extensive quantitative and qualitative evaluations demonstrate the advantages of the proposed morphable diffusion model over existing state-of-the-art avatar creation methods on both novel view and novel expression synthesis tasks.
Para Outro Idioma
do conteúdo original
arxiv.org
Principais Insights Extraídos De
by Xiyi Chen,Ma... às arxiv.org 04-03-2024
https://arxiv.org/pdf/2401.04728.pdfPerguntas Mais Profundas