The paper revisits the Google Landmarks v2 clean (GLDv2-clean) dataset, which is widely used for training state-of-the-art image retrieval models. The authors identify and remove landmark categories that overlap between the GLDv2-clean training set and the Revisited Oxford and Paris (ROxf and RPar) evaluation sets, creating a new version called RGLDv2-clean.
The authors then reproduce several state-of-the-art image retrieval methods using the same backbone network and training settings, and compare their performance on the original GLDv2-clean and the new RGLDv2-clean datasets. The results show a dramatic drop in performance when training on RGLDv2-clean, despite the small fraction of images and categories removed, highlighting the critical importance of avoiding class overlap between training and evaluation sets.
To address the challenge of detecting objects of interest and ignoring background clutter, the authors introduce CiDeR, a single-stage, end-to-end pipeline that uses an attention-based approach to localize objects and extract a global image representation. CiDeR does not require any location supervision, unlike previous detect-to-retrieve (D2R) methods that often involve complex two-stage training and indexing pipelines.
Experiments show that CiDeR outperforms previous state-of-the-art methods on both the original GLDv2-clean and the new RGLDv2-clean datasets, demonstrating the effectiveness of the proposed approach.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies