toplogo
Log på

Dual-Modal Prompting for Effective Fine-Grained Zero-Shot Sketch-Based Image Retrieval


Kernekoncepter
A dual-modal prompting strategy that leverages category-specific visual and textual insights to enable flexible adaptation of the retrieval model to unseen target categories, thereby achieving improved fine-grained zero-shot sketch-based image retrieval performance.
Resumé
This paper addresses the fine-grained zero-shot sketch-based image retrieval (ZS-SBIR) task, which aims to match a given hand-drawn sketch with the exact corresponding realistic image instance within the target category. The key insight is that existing generalization-based approaches, which derive knowledge from seen categories and directly transfer it to unseen categories, are sub-optimal for this task. This is because the knowledge accumulated for distinguishing instances within seen categories may not be fully transferable or applicable to unseen categories. To address this, the authors propose a dual-modal prompting CLIP (DP-CLIP) model. DP-CLIP leverages a visual prompting module and a textual prompting module to provide the retrieval model with category-centric insights, enabling it to adapt effectively to the target categories. The visual prompting module utilizes a few support images from the target category to generate category-specific visual prompts, which are then injected into the CLIP visual encoder to guide it in adapting to the target category. The textual prompting module employs the textual category label to produce category-specific channel scaling vectors, which are applied to the CLIP visual encoder to direct it to focus on channels relevant to the target category. Additionally, a customized patch-level matching module is designed to capture detailed local correspondences between sketches and photos, further improving the fine-grained retrieval performance. Extensive experiments on the Sketchy dataset demonstrate that DP-CLIP outperforms the state-of-the-art fine-grained ZS-SBIR method by a significant margin of 7.3% in Acc.@1. The authors also evaluate DP-CLIP on category-level ZS-SBIR benchmarks, where it achieves promising results.
Statistik
The Sketchy dataset contains 75,471 sketches and 73,002 photos across 125 categories. The Sketchy Ext dataset contains 73,002 photos across 104 training and 24 test categories. The TUBerlin Ext dataset contains 250 categories, with 80 sketches and 820 photos per category on average.
Citater
"Our key insight is that the generalization learning approaches in previous ZS-SBIR research, which derive knowledge that is effective for handling seen categories from training set [18] or pre-trained models [22] and then directly transfer it to target unseen categories, are not apt for this fine-grained and zero-shot recognition scenario." "To achieve this goal, we introduce an adaptive prompting strategy tailored for the fine-grained ZS-SBIR task."

Vigtigste indsigter udtrukket fra

by Liying Gao,B... kl. arxiv.org 04-30-2024

https://arxiv.org/pdf/2404.18695.pdf
Dual-Modal Prompting for Sketch-Based Image Retrieval

Dybere Forespørgsler

How can the proposed dual-modal prompting strategy be extended to other cross-modal retrieval tasks beyond sketch-based image retrieval

The proposed dual-modal prompting strategy in the DP-CLIP model can be extended to other cross-modal retrieval tasks by adapting the concept of category-centric insights and adaptive prompting to different modalities. For tasks beyond sketch-based image retrieval, such as text-to-image retrieval or image-to-audio retrieval, the model can utilize textual prompts and visual prompts tailored to each target category. By incorporating category-specific guidance from both modalities, the model can dynamically adapt to unseen categories and capture unique discriminative clues for effective retrieval. This approach can be applied to various cross-modal retrieval tasks to improve generalization and performance in zero-shot scenarios.

What are the potential limitations of the current visual and textual prompting modules, and how could they be further improved to enhance the model's adaptability to unseen categories

The current visual and textual prompting modules in the DP-CLIP model have potential limitations that could be further improved to enhance the model's adaptability to unseen categories. One limitation is the reliance on a fixed set of support images for generating visual prompts, which may not capture the full diversity of target categories. To address this, the model could benefit from a more diverse and representative set of support images for each target category. Additionally, the textual prompting module could be enhanced by incorporating more sophisticated text encoding techniques to extract richer category-specific information from the textual category labels. Furthermore, exploring advanced techniques for prompt tuning and adaptive feature calibration could improve the effectiveness of both modules in guiding the model's adaptation to novel categories.

Given the promising results on both fine-grained and category-level ZS-SBIR benchmarks, how could the DP-CLIP model be leveraged to address other challenging zero-shot visual recognition tasks, such as few-shot learning or open-set recognition

Given the promising results on fine-grained and category-level ZS-SBIR benchmarks, the DP-CLIP model can be leveraged to address other challenging zero-shot visual recognition tasks, such as few-shot learning or open-set recognition. For few-shot learning tasks, the model can be adapted to learn from a limited number of examples per category by incorporating few-shot learning techniques, such as meta-learning or episodic training. This would enable the model to generalize to new categories with minimal training data. In the context of open-set recognition, the DP-CLIP model could be extended to handle unseen classes during inference by incorporating outlier detection mechanisms or uncertainty estimation techniques. By leveraging the category-centric insights and adaptive prompting strategy of the DP-CLIP model, it can be applied to a wide range of zero-shot visual recognition tasks to improve performance and adaptability in challenging scenarios.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star