The paper presents Localization-aware Inversion (LocInv), a method for improving text-guided image editing using diffusion models. The key insights are:
Existing text-guided image editing methods suffer from cross-attention leakage, where the cross-attention maps do not accurately align with the corresponding objects in the image. This leads to unintended changes during editing.
LocInv addresses this issue by incorporating localization priors, such as segmentation maps or detection bounding boxes, to guide the inversion process and refine the cross-attention maps.
LocInv updates the token representations associated with objects at each timestep, using a similarity loss and an overlapping loss to ensure the cross-attention maps closely align with the localization priors.
Additionally, LocInv introduces an adjective binding loss to reinforce the connection between adjective and noun words, enabling better attribute editing.
Experiments on the COCO-edit dataset show that LocInv outperforms existing methods in both quantitative and qualitative evaluations, particularly in complex multi-object scenes and attribute editing tasks.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問