Applying classifier-free guidance only in a specific interval of noise levels during the sampling process significantly improves the quality and speed of diffusion model-generated images, outperforming standard guidance approaches.
A novel curriculum learning approach based on progressive object-level blurring that significantly improves the performance and stability of layout-to-image generation models.
The core message of this article is that the representation space around generated images exhibits distinct properties compared to real images, specifically in terms of complexity and vulnerability. The authors propose two novel metrics, anomaly score (AS) and anomaly score for individual images (AS-i), to effectively evaluate generative models and individual generated images based on these properties.
The proposed Energy-Calibrated Variational Autoencoder (EC-VAE) utilizes a conditional Energy-Based Model (EBM) to calibrate the generative direction of a Variational Autoencoder (VAE) during training, enabling it to generate high-quality samples without requiring expensive Markov Chain Monte Carlo (MCMC) sampling at test time. The energy-based calibration can also be extended to enhance variational learning and normalizing flows, and applied to zero-shot image restoration tasks.
This paper proposes BinaryDM, a novel quantization-aware training approach to push the weights of diffusion models towards the limit of 1-bit, achieving significant accuracy and efficiency gains compared to SOTA quantization methods under ultra-low bit-widths.
Introducing an additional mask prompt to better model the relationship between foreground and background, enabling the diffusion model to generate higher-quality and more controllable images that maintain higher fidelity to the reference image.
Diffusion-RWKV is a novel architecture that adapts the RWKV model for efficient and scalable image generation, achieving comparable performance to Transformer-based diffusion models while significantly reducing computational complexity.
This paper introduces a domain-general framework for many-to-many image generation, capable of producing interrelated image series from a given set of images, offering a scalable solution that obviates the need for task-specific solutions across different multi-image scenarios.
Visual Autoregressive (VAR) modeling redefines autoregressive learning on images as a coarse-to-fine "next-scale prediction" strategy, which allows autoregressive transformers to learn visual distributions fast and generalize well, surpassing diffusion models in image synthesis.
This work proposes a CLIP-based framework, OCC-CLIP, to determine if a given image was generated by the same model as a set of few-shot examples, even when the target model cannot be accessed.