Marioriyad, A., Rezaei, P., Baghshah, M.S., & Rohban, M.H. (2024). Diffusion Beats Autoregressive: An Evaluation of Compositional Generation in Text-to-Image Models. arXiv preprint arXiv:2410.22775v1.
This research paper aims to evaluate and compare the compositional generation capabilities of diffusion-based and autoregressive text-to-image (T2I) models. The authors investigate whether the next-token prediction paradigm employed in autoregressive models is sufficient for complex image generation from textual descriptions.
The study evaluates nine state-of-the-art T2I models, including Stable Diffusion variants, DALL-E variants, Pixart-α, FLUX variants, and LlamaGen variants. The authors use the T2I-CompBench benchmark to assess the models' performance in four compositional generation aspects: attribute binding, object relationships, numeracy, and complex compositions. The evaluation employs various metrics, including BLIP-VQA for attribute binding, UniDet for spatial relationships and numeracy, CLIP similarity score, GPT-based multi-modal evaluation, chain-of-thought prompting using ShareGPT-4v, and a 3-in-1 metric combining CLIP, BLIP-VQA, and UniDet scores.
The results demonstrate that diffusion-based models consistently outperform autoregressive models in all compositional generation tasks. Notably, LlamaGen, a vanilla autoregressive model, underperforms even compared to Stable Diffusion v1.4, a diffusion model with similar model size and inference time. This suggests that relying solely on next-token prediction without incorporating additional inductive biases might be insufficient for achieving comparable performance to diffusion models in compositional generation. Conversely, the open-source diffusion-based model FLUX exhibits competitive performance compared to the state-of-the-art closed-source model DALL-E3.
The study concludes that the pure next-token prediction paradigm might not be adequate for generating images that fully align with complex textual prompts. The authors suggest that incorporating inductive biases tailored to visual generation, exploring alternative image tokenization methods, and further investigating the limitations of autoregressive models in capturing complex conditions are crucial areas for future research.
This research provides valuable insights into the strengths and limitations of different generative approaches for T2I synthesis, particularly concerning compositional generation. The findings highlight the importance of inductive biases in visual generation and encourage further exploration of alternative architectures and training strategies for autoregressive models to improve their compositional generation capabilities.
The study primarily focuses on evaluating existing models using a specific benchmark. Future research could explore novel architectures and training methodologies for autoregressive models, investigate the impact of different image tokenizers, and develop new benchmarks to assess compositional generation capabilities comprehensively.
翻譯成其他語言
從原文內容
arxiv.org
深入探究