The article introduces the concept of referring object removal using natural language expressions. It presents the ComCOCO dataset with 136,495 referring expressions for 34,615 objects in 23,951 image pairs. The proposed Syntax-Aware Hybrid Mapping Network combines linguistic and visual features to remove specific objects guided by natural language instructions. Extensive experiments show significant performance improvement over diffusion models and two-stage methods. The study emphasizes the importance of structured datasets and end-to-end models for complex image editing tasks.
Egy másik nyelvre
a forrásanyagból
arxiv.org
Mélyebb kérdések