The paper proposes a new text-to-image generation framework called Recaption, Plan and Generate (RPG) that utilizes multimodal large language models (MLLMs) to improve the compositional capabilities of diffusion models.
The key strategies of RPG are:
Multimodal Recaptioning: RPG uses MLLMs to transform complex text prompts into highly descriptive ones, providing informative augmented prompt comprehension and semantic alignment.
Chain-of-Thought Planning: RPG partitions the image space into complementary subregions and assigns different subprompts to each subregion, breaking down compositional generation tasks into simpler subtasks. It harnesses the powerful chain-of-thought reasoning capabilities of MLLMs to efficiently plan the region division.
Complementary Regional Diffusion: Based on the planned subregions and their respective prompts, RPG introduces a novel complementary regional diffusion approach to enhance the flexibility and precision of compositional text-to-image generation.
RPG can also be extended to text-guided image editing tasks by integrating the recaptioning, planning, and contour-based regional diffusion editing. Extensive experiments demonstrate that RPG outperforms state-of-the-art text-to-image diffusion models, particularly in multi-category object composition and text-image semantic alignment.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問