แนวคิดหลัก
The paper presents a set of practical guidelines for designing high-performance fine-grained image retrieval models, and proposes a novel Dual Visual Filtering mechanism and a Discriminative Model Training strategy to effectively capture subcategory-specific discrepancies and enhance the model's discriminability and generalization ability.
บทคัดย่อ
The paper focuses on the task of fine-grained image retrieval (FGIR), which aims to retrieve images with the same subcategory as the query image from a database within the same metacategory (e.g., birds, cars). The authors identify three key challenges in FGIR:
Small-sized objects in the input image make it difficult to identify discriminative regions.
Significant intra-class variations and subtle inter-class differences require highlighting subcategory-specific discrepancies.
Limited fine-grained image data jeopardizes the model's discriminative capacity and generalization ability.
To address these challenges, the authors propose the following guidelines for designing high-performance FGIR models:
G1: Emphasize the object by utilizing object-emphasized images as input.
G2: Highlight subcategory-specific discrepancies to improve the model's discriminative capability.
G3: Employ effective training strategies to alleviate the limitation of limited fine-grained image data.
Following these guidelines, the authors develop a Dual Visual Filtering (DVF) mechanism, which consists of:
Object-oriented Visual Filtering (OVF) module: Utilizes a visual foundation model to zoom in on the object in the input image, helping to capture more discriminative details.
Semantic-oriented Visual Filtering (SVF) module: Calculates token-level importance to enhance the selection of discriminative tokens and eliminate noisy features, improving the model's ability to discern subtle discrepancies between subcategories.
Additionally, the authors propose a Discriminative Model Training (DMT) strategy that combines data augmentation with contrastive loss to enhance the model's discriminability and generalization ability.
Extensive experiments on three fine-grained image retrieval benchmarks (CUB-200-2011, Stanford Cars 196, and NABirds) demonstrate the superior performance of the proposed DVF model in both closed-set and open-set settings, outperforming state-of-the-art methods.
สถิติ
The CUB-200-2011 dataset contains 11,788 bird images from 200 bird species.
The Stanford Cars 196 dataset consists of 16,185 images depicting 196 car variants.
The NABirds dataset contains 48,562 images showcasing North American birds across 555 subcategories.
คำพูด
"The essence of FGIR tasks lies in learning discriminative and generalizable embeddings to identify visually similar objects."
"Subtle yet discriminative discrepancies are widely acknowledged as crucial for FGIR."
"Limited fine-grained image data inevitably limits retrieval performance."