The paper presents Localization-aware Inversion (LocInv), a method for improving text-guided image editing using diffusion models. The key insights are:
Existing text-guided image editing methods suffer from cross-attention leakage, where the cross-attention maps do not accurately align with the corresponding objects in the image. This leads to unintended changes during editing.
LocInv addresses this issue by incorporating localization priors, such as segmentation maps or detection bounding boxes, to guide the inversion process and refine the cross-attention maps.
LocInv updates the token representations associated with objects at each timestep, using a similarity loss and an overlapping loss to ensure the cross-attention maps closely align with the localization priors.
Additionally, LocInv introduces an adjective binding loss to reinforce the connection between adjective and noun words, enabling better attribute editing.
Experiments on the COCO-edit dataset show that LocInv outperforms existing methods in both quantitative and qualitative evaluations, particularly in complex multi-object scenes and attribute editing tasks.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Chuanming Ta... في arxiv.org 05-03-2024
https://arxiv.org/pdf/2405.01496.pdfاستفسارات أعمق