Core Concepts
DARL unifies visual perception and generation by combining autoregressive and denoising diffusion models, achieving performance comparable to state-of-the-art masked prediction models.
Abstract
DARL introduces a new generative approach for learning visual representations. It employs a decoder-only Transformer to predict image patches autoregressively. By replacing Mean Squared Error (MSE) loss with the diffusion objective using a denoising patch decoder, DARL enhances image generation ability. The learned representation improves with tailored noise schedules and longer training in larger models. Despite its simple architecture, DARL delivers performance close to state-of-the-art masked prediction models under the fine-tuning protocol.
Stats
Training for 800 epochs yields superior results for diffusion objective.
ViT-L16 model pre-trained with MSE achieves top-1 accuracy of 82.7%.
DARL trained with diffusion objective shows minor performance gap of 1% compared to state-of-the-art masked prediction models.
Quotes
"DARL delivers performance remarkably close to state-of-the-art masked prediction models under the fine-tuning protocol."
"The optimal noise schedule differs significantly from those suitable for generation purpose."