Core Concepts
ClickDiffusion is an interactive system that enables users to perform precise image manipulations by seamlessly combining natural language instructions and direct manipulation of visual elements.
Abstract
ClickDiffusion is a novel system that allows users to perform precise image editing tasks by combining natural language instructions and direct manipulation of visual elements in the image. The key innovations are:
ClickDiffusion integrates natural language instructions and direct manipulation of objects (e.g., selecting and moving objects with bounding boxes) to enable users to perform complex image editing tasks that are difficult to achieve with text-only approaches.
The system serializes the image layout and multi-modal instructions into a textual format, which can then be processed by large language models (LLMs) to generate the transformed image layout. This allows ClickDiffusion to leverage the few-shot generalization capabilities of LLMs to handle a wide range of possible transformations without the need for expensive training.
The ClickDiffusion user interface provides a simple and intuitive set of tools (select, bounding box, star, etc.) that allow users to easily specify visual elements and combine them with natural language instructions to perform precise image manipulations.
The authors demonstrate that ClickDiffusion outperforms text-only image editing approaches like InstructPix2Pix and LLM Grounded Diffusion, especially in scenarios that require disambiguating and manipulating specific objects within a complex scene.
Stats
"Move the red ball in the center to the left of the red ball on the right and make it black"
"Move the top left red apple and top green apple onto the plate"
Quotes
"By representing visual information like mouse interactions and bounding boxes in a textual form, we can cast the problem of image editing as one of text generation using LLMs."
"Leveraging the few-shot generalization capabilities of LLMs allows our system to generalize to a wide range of possible transformations without the need for expensive training."