Yu, Q., He, J., Deng, X., Shen, X., & Chen, L. (2024). Randomized Autoregressive Visual Generation. arXiv preprint arXiv:2411.00776.
This paper introduces Randomized Autoregressive Modeling (RAR) to address the limitations of unidirectional context modeling in autoregressive image generation while preserving compatibility with language modeling frameworks.
RAR introduces a randomness annealing training strategy where the input image token sequence is randomly permuted with a probability that decays linearly from 1 to 0 during training. This allows the model to learn bidirectional contexts while gradually converging to a fixed raster scan order. Additionally, target-aware positional embeddings are incorporated to address potential confusions arising from permuted training. The authors evaluate RAR on the ImageNet-256 benchmark using FID scores and compare its performance and sampling speed to other state-of-the-art image generation models.
RAR presents a simple yet effective approach to enhance autoregressive image generation by incorporating bidirectional context learning without deviating from the standard autoregressive framework. This approach achieves state-of-the-art results, demonstrating the potential of RAR for advancing unified frameworks for visual understanding and generation.
This research significantly contributes to the field of autoregressive visual modeling by introducing a novel training strategy that addresses the limitations of unidirectional context modeling. The compatibility of RAR with language modeling frameworks opens up possibilities for leveraging advancements in LLMs for improved visual generation.
While RAR effectively incorporates bidirectional context learning, the generation process still faces limitations in capturing full global context due to the sequential nature of token generation. Future research could explore techniques like resampling or refinement to address this limitation. Additionally, investigating the applicability of RAR to other visual modalities and downstream tasks could further expand its impact.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Qihang Yu, J... at arxiv.org 11-04-2024
https://arxiv.org/pdf/2411.00776.pdfDeeper Inquiries