Scaling Diffusion-based Text-to-Image Generation: Insights from Extensive Ablations on Denoising Backbones and Training Data
Empirical study on scaling diffusion-based text-to-image generation models by investigating the effects of scaling denoising backbones and training datasets. Key findings include the importance of denoising backbone design, efficient ways to scale UNet and Transformer models, and the significant impact of dataset scaling and caption enhancement on model performance.