Bibliographic Information: Bai, J., Ye, T., Chow, W., Song, E., Chen, Q.-G., Li, X., Dong, Z., Zhu, L., & Yan, S. (2024). Meissonic: Revitalizing Masked Generative Transformers for Efficient High-Resolution Text-to-Image Synthesis [Technical Report]. arXiv:2410.08261v1 [cs.CV].
Research Objective: This paper introduces Meissonic, a novel text-to-image synthesis model that aims to overcome the limitations of existing masked image modeling (MIM) approaches, particularly in generating high-resolution images and achieving comparable performance to leading diffusion models.
Methodology: Meissonic leverages a multi-modal transformer architecture with several key innovations: a combination of multi-modal and single-modal transformer layers, advanced positional encoding using Rotary Position Embeddings (RoPE), and an adaptive masking rate as a sampling condition. The model is trained progressively through four stages, each focusing on specific aspects of image synthesis quality, and incorporates micro-conditions like original image resolution, crop coordinates, and human preference scores. Feature compression layers are integrated to enable efficient high-resolution generation.
Key Findings: Meissonic demonstrates superior performance in generating high-resolution (1024x1024) images while maintaining efficiency, even on consumer-grade GPUs with limited VRAM. It outperforms existing MIM methods and achieves comparable or superior results to state-of-the-art diffusion models like SDXL in terms of image quality, detail, and text-image alignment.
Main Conclusions: Meissonic presents a significant advancement in MIM-based text-to-image synthesis, offering a viable and efficient alternative to diffusion models. Its ability to generate high-quality, high-resolution images on readily available hardware makes it a valuable tool for various applications.
Significance: This research pushes the boundaries of MIM methods in text-to-image synthesis, demonstrating their potential to compete with and even surpass diffusion models in performance while maintaining efficiency.
Limitations and Future Research: While Meissonic excels in high-resolution image generation, the authors acknowledge that further research is needed to explore its capabilities in generating text within images, a feature currently limited by the choice of text encoder.
To Another Language
from source content
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Jinbin Bai, ... lúc arxiv.org 10-14-2024
https://arxiv.org/pdf/2410.08261.pdfYêu cầu sâu hơn