The paper proposes SiTH, a two-stage pipeline for single-view textured human reconstruction.
In the first stage, SiTH employs an image-conditioned diffusion model to hallucinate perceptually consistent back-view appearances based on the input front-view image. This is achieved by adapting the network architecture of a pretrained diffusion model to enable conditioning on the front-view image, UV map, and silhouette mask.
In the second stage, SiTH leverages the input front-view image and the generated back-view image to reconstruct a full-body textured mesh. It utilizes a skinned body prior to handle the 3D ambiguity and employs pixel-aligned feature querying to learn a mapping from the images to the 3D geometry and texture.
The authors demonstrate that SiTH can efficiently produce high-quality and diverse 3D textured humans from single images, including unseen photos and AI-generated images. Extensive evaluations on two 3D human benchmarks show that SiTH outperforms state-of-the-art methods in terms of accuracy and perceptual quality.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies