Core Concepts
DARL unifies visual perception and generation through autoregressive and denoising diffusion models.
Abstract
1. Abstract:
- DARL uses a decoder-only Transformer for autoregressive image patch prediction.
- Training with Mean Squared Error (MSE) leads to strong representations.
- Diffusion objective with denoising patch decoder enhances image generation ability.
- Tailored noise schedules and longer training improve learned representations.
2. Introduction:
- Large Language Models (LLMs) popular for generative pre-training.
- Vision representation learning and image generation often use separate techniques.
- Autoregressive and denoising diffusion models combined in DARL for unified model.
3. Denoising Autoregressive Representation Learning (DARL):
- Architecture: Vision Transformer backbone with causal attention masking.
- Rotary Positional Embedding for improved performance.
- Diffusion objective for multi-modal belief over patch content.
4. Experiments:
- Fixed patch ordering strategies compared to raster order.
- Random ordering with XLNet architecture shows longer training required.
- DARL achieves comparable performance to state-of-the-art models.
5. Impact Statement:
- Ethical concerns about image generation and dataset bias.
- DARL advances generative pre-training in the vision domain.
Stats
MSE 손실로 훈련하면 강력한 표현을 제공합니다.
모델 성능을 향상시키기 위해 확산 목표를 사용합니다.
모델 성능은 훈련 중 사용된 소음 일정에 의해 결정됩니다.
Quotes
"Despite its simple architecture, DARL delivers performance remarkably close to state-of-the-art masked prediction models."
"Our method significantly advances representation learning with generative pre-training."