The paper presents Ranni, a new approach for text-to-image generation that aims to improve the accuracy of following complex instructions. Ranni introduces a semantic panel as a middleware between text and image, which acts as a bridge between the two modalities.
The text-to-image generation process in Ranni is divided into two sub-tasks: text-to-panel and panel-to-image. In the text-to-panel stage, large language models (LLMs) are used to parse the input text and generate a semantic panel that represents the visual concepts, including their bounding boxes, colors, keypoints, and textual descriptions. This semantic panel is then used as a control signal to guide the diffusion-based image generation in the panel-to-image stage.
The introduction of the semantic panel allows Ranni to better follow complex instructions, such as those involving quantity, object-attribute binding, and multi-subject descriptions, which are challenging for existing text-to-image models. Ranni also enables intuitive image editing by allowing users to directly manipulate the semantic panel, either manually or with the help of LLMs.
The paper also presents an automatic data preparation pipeline to construct a large dataset of image-text-panel triples, which enables efficient training of the Ranni framework. Experiments show that Ranni outperforms existing text-to-image models on various alignment tasks, and demonstrates the potential of a fully-automatic, chatting-based image creation system.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询