The paper investigates the potential of combining the U-Net architecture with Diffusion Transformers (DiTs) for latent-space image generation tasks. The authors first conduct a toy experiment by comparing a U-Net-style DiT (DiT-UNet) with an isotropic DiT, finding that the U-Net inductive bias is not fully leveraged when simply combining U-Nets and transformer blocks.
Inspired by the observation that the U-Net backbone features are low-frequency-dominated, the authors propose to downsample the tokens for self-attention in the U-Net-style DiT. This simple yet effective modification, called U-DiT, brings significant performance improvements despite a considerable reduction in computation costs.
The authors then scale up the U-DiT models and conduct extensive experiments, demonstrating the extraordinary performance and scalability of U-DiTs compared to DiTs and their improvements. Specifically, U-DiT-B can outperform the much larger DiT-XL/2 model with only 1/6 of its computation cost. The authors also show that U-DiTs consistently outperform their isotropic counterparts across different training iterations.
The key innovations of the U-DiT models include:
The paper demonstrates the effectiveness of the proposed U-DiT models and highlights the potential of combining the U-Net architecture with Diffusion Transformers for efficient and high-quality latent-space image generation.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問