The paper introduces ControlNet++, a novel approach to enhance the controllability of text-to-image diffusion models. Existing methods still face challenges in generating images that accurately align with the given image conditional controls.
To address this, the authors propose to model controllable generation as an image translation task, using pre-trained discriminative reward models to extract the corresponding condition from the generated images and then optimize the consistency loss between the input and extracted conditions. This explicit pixel-level cycle consistency optimization is different from existing methods that achieve controllability implicitly through the latent-space denoising process.
The authors also introduce an efficient reward fine-tuning strategy that disturbs the consistency between input images and conditions, and then uses the single-step denoised images for reward fine-tuning. This avoids the extensive time and memory costs associated with multiple image sampling steps required by a straightforward implementation.
Extensive experiments show that ControlNet++ significantly outperforms existing state-of-the-art methods in terms of controllability under various conditional controls, such as segmentation masks, line-art edges, and depth maps, without compromising image quality. The authors also demonstrate that the generated images from ControlNet++ can effectively boost the performance of downstream discriminative models trained on the generated data.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы