Core Concepts
Diffusion Transformers (DiTs) face significant inference latency challenges due to their quadratic computational scaling with sequence length. xDiT offers a solution through a hybrid parallel inference engine combining Sequence Parallelism, a novel Patch-level Pipeline Parallelism (PipeFusion), and CFG parallelism, enabling efficient DiT deployment across diverse hardware interconnects and model architectures.
Stats
The sequence length of the input to transformers in high-quality image and video generation tasks can exceed 1 million tokens.
The leading open-source image generation model, Flux.1, generates images with a resolution of 1024px (1024×1024), requiring a sequence length of 262 thousand tokens; for 4096px resolution images, the input sequence includes 4.2 million tokens.
The leading open-source video generation model, CogVideoX, generates a 6-second video at 480x720 resolution with a sequence containing 17K tokens; if used to generate a one-minute 4K (3840×2160) video, the sequence length exceeds 4 million.
For the 4096px image generation task, xDiT achieved a speedup of 13.29× on 16 GPUs compared to a single GPU, reducing the latency from 245 seconds to 17 seconds.
In the 2048px task on 8×A100 GPUs, PipeFusion exhibited poor scalability due to the skip-connection structure of the DiT model.
For the 4096px task, DistriFusion encountered out-of-memory (OOM) issues due to the linearly increasing memory cost of the KV buffer with sequence length.
xDiT's VAE, using patch parallelism, can generate an image resolution of 7168px, more than 12.25 times larger than the naive VAE approach on 8×L40 GPUs.