Kernkonzepte
Multi-modal prompts in pre-trained models act as dataset bias, enhancing recognition performance.
Zusammenfassung
This paper explores the mechanism of multi-modal prompts in pre-trained vision-language models. It delves into how prompts improve recognition performance through attention and alignment statistics. The study reveals that prompts mainly function as dataset bias, influencing the model's adaptation to specific datasets. Visualization experiments demonstrate the impact of prompts on attention distribution and feature extraction. A novel bias tuning method is proposed to validate the importance of dataset bias in model performance.
Abstract
Prompt learning enhances recognition performance by acting as dataset bias.
Introduction
Pre-trained Vision-Language (VL) models leverage image-text pairs for various tasks.
Preliminaries
Vision and text encoders process input data with self-attention mechanisms.
Exploring Experiments
Attention formulation analyzes how prompts affect attention weights.
What do the multi-modal prompts learn?
Textual and vision prompts influence attention distribution and feature extraction.
Validation for the importance of the bias
Bias tuning method validates the significance of dataset bias in model adaptation.
Statistiken
プロンプトはデータセットのバイアスとして機能し、認識性能を向上させます。