核心概念
Novel approach using self-supervised discrimination to enhance training efficiency and generative capacity in Diffusion Transformer.
要約
The content introduces SD-DiT, a novel approach that leverages self-supervised discrimination to improve the training efficiency and generative capacity of Diffusion Transformer models. The article discusses the limitations of existing mask strategies and proposes a new method that decouples the encoder-decoder structure to address these limitations. By framing mask modeling in a teacher-student manner, SD-DiT aims to strengthen both discriminative and generative objectives effectively and efficiently. Extensive experiments on ImageNet dataset demonstrate the competitive performance of SD-DiT compared to state-of-the-art DiT models.
1. Introduction
- Diffusion models have gained popularity in computer vision for image generation.
- The emergence of Diffusion Transformer (DiT) has improved scalability and capacity.
- Training DiT models faces challenges like slow convergence and heavy computation burden.
2. Related Work
- Denoising diffusion probabilistic models accelerate generative tasks.
- Transformers provide scaling properties for diffusion models.
- Self-supervised learning with diffusion models integrates mask strategies for improved performance.
3. Approach
- SD-DiT introduces self-supervised discrimination knowledge distilling.
- Decoupled encoder-decoder structure is utilized for joint training of discriminative and generative objectives.
- Generative loss is optimized without mask reconstruction, enhancing training efficiency.
Data Extraction:
- "Extensive experiments are conducted on ImageNet dataset."
- "Our method achieves a competitive balance between training cost and generative capacity."
統計
Extensive experiments are conducted on ImageNet dataset.
Our method achieves a competitive balance between training cost and generative capacity.