The article introduces the concept of referring object removal using natural language expressions. It presents the ComCOCO dataset with 136,495 referring expressions for 34,615 objects in 23,951 image pairs. The proposed Syntax-Aware Hybrid Mapping Network combines linguistic and visual features to remove specific objects guided by natural language instructions. Extensive experiments show significant performance improvement over diffusion models and two-stage methods. The study emphasizes the importance of structured datasets and end-to-end models for complex image editing tasks.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Xiangtian Xu... at arxiv.org 03-15-2024
https://arxiv.org/pdf/2403.09128.pdfDeeper Inquiries