The paper introduces upsample guidance, a novel technique that allows diffusion models to generate high-resolution images without requiring additional training or external models. The key insights are:
Diffusion models have difficulty directly generating high-resolution samples, and previous solutions involve modifying the architecture, further training, or using multiple stages.
Upsample guidance addresses this issue by matching the signal-to-noise ratio (SNR) between the trained low-resolution model and the target high-resolution, and adding a guidance term to the predicted noise. This enables the model to generate high-resolution images from pre-trained low-resolution models.
The method is universally applicable to various types of diffusion models, including pixel-space, latent-space, and video diffusion models. It is also compatible with other techniques that improve or control diffusion models, such as SDEdit, ControlNet, LoRA, and IP-Adapter.
Experiments demonstrate that upsample guidance can effectively resolve artifacts and improve image quality, fidelity, and prompt alignment across different models, resolutions, and conditional generation methods.
The technique is computationally efficient, with only a small additional cost compared to the original sampling process.
For latent diffusion models, the authors propose a time-dependent guidance scale to prevent artifacts introduced by the encoder-decoder structure.
The paper also explores spatial and temporal upsampling in video generation models, showing the versatility of the method.
To Another Language
from source content
arxiv.org
Głębsze pytania