Harnessing Multimodal LLMs for Compositional Text-to-Image Generation and Editing
This paper introduces a novel training-free framework called Recaption, Plan and Generate (RPG) that leverages the powerful reasoning abilities of multimodal large language models (MLLMs) to enhance the compositionality and controllability of text-to-image diffusion models.