Core Concepts
Our method enables mask-free local image editing by learning to generate bounding boxes that align with the provided text descriptions, without requiring user-specified masks or regions.
Abstract
The paper introduces a method for text-driven image editing that can generate realistic and relevant edited images without the need for user-specified regions or masks. The key components are:
Region Generation Network: This network learns to generate bounding boxes around relevant regions in the input image that align with the provided text descriptions. It uses CLIP guidance to learn the appropriate regions.
Compatibility with Existing Models: The proposed region generation component can be integrated with different image synthesis models, including non-autoregressive transformers like MaskGIT and diffusion models like Stable Diffusion.
Evaluation: Extensive user studies show that the proposed method outperforms state-of-the-art text-driven image editing approaches in terms of generating edited images that are faithful to the text prompts while preserving the original image context.
The paper demonstrates the flexibility and effectiveness of the learnable regions approach for text-guided image editing, without requiring manual mask specification.
Stats
The paper does not provide any specific numerical data or statistics. The key results are based on qualitative comparisons and user studies.
Quotes
The paper does not contain any striking quotes that support the key arguments.