核心概念
ADDP proposes an Alternating Denoising Diffusion Process that bridges pixel and token spaces, enabling the learning of general representations applicable to both image recognition and generation tasks.
摘要
The paper introduces ADDP, a general representation learning framework that can be applied to both image recognition and generation tasks. The key insights are:
- Pixels as inputs are crucial for recognition tasks, as they preserve spatially sensitive information better than VQ tokens.
- VQ tokens as reconstruction targets are beneficial for generation tasks, as they can help the model eliminate imperceptible image details and improve generation quality.
To leverage the advantages of both pixel and token spaces, ADDP proposes an Alternating Denoising Diffusion Process. At each step, the model first decodes pixels from previous VQ tokens, then generates new VQ tokens from the decoded pixels. The diffusion process gradually masks out a portion of VQ tokens to construct the training samples.
The learned representations from ADDP can be used for both image generation and various recognition tasks, including classification, detection, and segmentation. Extensive experiments show that ADDP achieves competitive performance on unconditional generation, ImageNet classification, COCO detection, and ADE20k segmentation, outperforming previous methods designed for either recognition or generation alone.
统计
The ImageNet-1k dataset is used for pre-training.
FID and IS scores are reported on the ImageNet-1k 256x256 validation set for unconditional generation.
Top-1 accuracy is reported for ImageNet-1k classification after fine-tuning.
AP for object detection on COCO test-dev set and mIoU for semantic segmentation on ADE20k validation set are reported.
引用
"Pixels as inputs are crucial for recognition tasks. Pixels preserve spatially sensitive information better than VQ tokens, which is particularly useful for dense recognition tasks."
"VQ tokens as reconstruction targets are beneficial for generation tasks. Previous works such as (van den Oord et al., 2017; Rombach et al., 2022) show that compared to generating raw pixels, predicting VQ tokens can help the model eliminate imperceptible image details, mitigating the optimization difficulty and resulting in better image generation quality."