The paper proposes DiffScaler, a method to efficiently scale pre-trained diffusion transformer models to perform diverse image generation tasks. The key insights are:
DiffScaler introduces a lightweight "Affiner" block that can be plugged into each trainable layer of the diffusion model. The Affiner block learns task-specific scaling and shifting of the weights, as well as additional task-specific subspaces, allowing the model to adapt to new datasets and conditions with minimal additional parameters.
Experiments show that transformer-based diffusion models adapt better to smaller datasets compared to CNN-based models when performing parameter-efficient fine-tuning. DiffScaler enables a single transformer-based diffusion model to generate high-quality images across multiple unconditional datasets (e.g., FFHQ, Flowers, CUB-200, Caltech-101) and conditional tasks (e.g., depth maps, segmentation maps, canny edges) with just 0.5-0.9% of the total model parameters.
DiffScaler outperforms existing parameter-efficient fine-tuning methods like DiffFit and LORA, while achieving comparable performance to full fine-tuning, demonstrating its effectiveness in scaling diffusion models to diverse tasks.
The authors also show that DiffScaler can be used to enable a single text-conditioned diffusion model to perform multiple spatial conditioning tasks simultaneously, without the need for separate encoders or zero-initialized convolutional layers as in ControlNet.
To Another Language
from source content
arxiv.org
Principais Insights Extraídos De
by Nithin Gopal... às arxiv.org 04-16-2024
https://arxiv.org/pdf/2404.09976.pdfPerguntas Mais Profundas