The article introduces the concept of referring object removal using natural language expressions. It presents the ComCOCO dataset with 136,495 referring expressions for 34,615 objects in 23,951 image pairs. The proposed Syntax-Aware Hybrid Mapping Network combines linguistic and visual features to remove specific objects guided by natural language instructions. Extensive experiments show significant performance improvement over diffusion models and two-stage methods. The study emphasizes the importance of structured datasets and end-to-end models for complex image editing tasks.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Xiangtian Xu... alle arxiv.org 03-15-2024
https://arxiv.org/pdf/2403.09128.pdfDomande più approfondite