แนวคิดหลัก
Visual Autoregressive (VAR) modeling redefines autoregressive learning on images as a coarse-to-fine "next-scale prediction" strategy, which allows autoregressive transformers to learn visual distributions fast and generalize well, surpassing diffusion models in image synthesis.
บทคัดย่อ
The content presents a new visual generative framework called Visual Autoregressive (VAR) modeling, which addresses the limitations of standard autoregressive (AR) models for image generation.
Key highlights:
- VAR reformulates autoregressive modeling on images as "next-scale prediction" instead of the conventional "next-token prediction". This allows VAR to preserve spatial locality, satisfy the mathematical premise, and improve computational efficiency.
- VAR models exhibit clear power-law scaling laws similar to those observed in large language models (LLMs), with linear correlation coefficients near -0.998, indicating robust scalability.
- VAR surpasses strong diffusion models like Diffusion Transformer (DiT) in image quality, data efficiency, inference speed, and scalability.
- VAR showcases zero-shot generalization ability in downstream tasks like image in-painting, out-painting, and editing.
- The authors have released all models and codes to promote the exploration of AR/VAR models for visual generation and unified learning.
สถิติ
The content does not provide specific numerical data or metrics to support the key logics. It focuses on describing the VAR framework and comparing its performance with other generative models.
คำพูด
The content does not contain any striking quotes that support the key logics.