Core Concepts
Show-o is a unified transformer model that can handle both multimodal understanding and generation tasks using a single network, unifying autoregressive and discrete diffusion modeling.
Abstract
The key highlights and insights from the content are:
Show-o is a novel unified transformer model that can handle both multimodal understanding and generation tasks using a single network. It unifies autoregressive and discrete diffusion modeling within one transformer architecture.
Unlike existing approaches that treat understanding and generation as separate tasks, Show-o can perform both through a unified prompting strategy that formats various input data into a structured sequence. It employs an "omni-attention" mechanism that adaptively applies causal attention for text tokens and full attention for image tokens.
Show-o demonstrates comparable or even better performance compared to individual models tailored for either understanding or generation, across various benchmarks, despite having a smaller or equivalent model size. This highlights its potential as a next-generation foundation model.
Show-o supports a wide range of vision-language tasks including visual question-answering, text-to-image generation, text-guided inpainting/extrapolation, and mixed-modality generation, without requiring any fine-tuning.
The authors explore the impact of different image representations (discrete or continuous) on multimodal understanding performance, providing insights for improving the design of unified models.
Show-o's training pipeline involves three stages: 1) learning image token embeddings and pixel dependencies, 2) aligning image-text for understanding and generation, and 3) fine-tuning on high-quality data.
Stats
"Alone we can do so little; together we can do so much." – Helen Keller
Show-o requires approximately 20 times fewer sampling steps compared to autoregressively generating an image.
Show-o is built upon a pre-trained large language model (LLM) and inherits the autoregressive modeling capability for text-based reasoning.
Quotes
"can one single transformer handle both multimodal understanding and generation?"
"can such one single transformer involve both autoregressive and diffusion modeling?"