Core Concepts
A novel framework for instruction-driven editing of animatable 3D human avatars, which significantly outperforms existing 3D editing methods in producing high-quality, consistent, and faithful edits.
Abstract
The paper presents InstructHumans, a framework for instruction-driven editing of animatable 3D human avatars. Existing text-based editing methods use Score Distillation Sampling (SDS) to distill guidance from generative models, but the authors show that naively using such scores can be harmful to editing as they destroy consistency with the source avatar.
The authors propose an alternate SDS for Editing (SDS-E) that selectively incorporates subterms of SDS across diffusion timesteps. They further enhance SDS-E with spatial smoothness regularization and gradient-based viewpoint sampling to achieve high-quality edits with sharp and high-fidelity detailing.
InstructHumans significantly outperforms existing 3D editing methods, producing edits that are consistent with the initial avatar while faithful to the textual instructions. The framework can generate animatable 3D human avatars that can be driven by arbitrary SMPL-X poses.
The authors provide a detailed analysis of the different SDS terms and their impact at different timesteps, leading to the design of SDS-E. They also conduct qualitative and quantitative evaluations, including a user study, demonstrating the superiority of their approach over state-of-the-art methods.
Stats
The paper does not contain any key metrics or important figures to support the author's key logics.
Quotes
The paper does not contain any striking quotes supporting the author's key logics.