Core Concepts
Cross-attention in text-to-image diffusion models can be selectively cached and reused to significantly improve inference efficiency without compromising generation quality.
Abstract
The study explores the role of cross-attention during the inference process of text-conditional diffusion models. The key findings are:
-
Cross-attention outputs converge to a fixed point after the first few inference steps. This observation divides the entire inference process into two stages:
- Semantics-planning stage: The model relies on cross-attention to plan text-oriented visual semantics.
- Fidelity-improving stage: The model generates images from the previously planned semantics.
-
Cross-attention is redundant in the fidelity-improving stage. Bypassing cross-attention in this stage can reduce computation complexity without compromising generation quality.
Based on these findings, the authors propose a simple and training-free method called TGATE (Temporally Gating Cross-Attention) to improve the efficiency of text-to-image diffusion models:
- TGATE caches the cross-attention outputs once they converge and reuses them in the fidelity-improving stage, eliminating redundant cross-attention computations.
- TGATE can reduce the number of Multiple-Accumulate Operations (MACs) by up to 65T per image and eliminate 0.5B parameters in the fidelity-improving stage, resulting in around 50% latency reduction compared to the baseline model.
- TGATE maintains or even slightly improves the generation quality, as measured by the Fréchet Inception Distance (FID) on the MS-COCO validation set.
The authors further demonstrate the effectiveness and broad applicability of TGATE by integrating it with various base models, noise schedulers, and acceleration techniques, consistently achieving improved efficiency without compromising performance.
Stats
Cross-attention maps converge within the first 5-10 inference steps.
TGATE can reduce 65T MACs per image and eliminate 0.5B parameters in the fidelity-improving stage.
TGATE can reduce the latency by around 50% compared to the baseline model (SDXL).
Quotes
"Cross-attention is redundant in the fidelity-improving stage. During the semantics-planning stage, cross-attention plays a crucial role in creating meaningful semantics. Yet, in the latter stage, cross-attention converges and has a minor impact on the generation process."
"Bypassing cross-attention during the fidelity-improving stage can potentially reduce the computation cost while maintaining the image generation quality."