Khái niệm cốt lõi
Prompt tuning can lead to misalignment in vision-language models, but feature shift consistency can help maintain alignment and improve generalization.
Tóm tắt
この論文は、プロンプト調整がビジョン-言語モデルの不一致を引き起こす可能性があることを示しています。特に、単一の分野でのプロンプト調整は、特定のタスクに対するパフォーマンス向上をもたらす一方で、モデルの汎化能力を損なう可能性があります。そこで、特徴シフト一貫性は、モデルのアライメントを維持し、汎化能力を向上させるのに役立つことが示されています。
Thống kê
Prompt tuning methods fine-tune the model by introducing learnable prompts.
The feature shift is used to estimate the variation of features generated by the vision-language model caused by prompt tuning.
The feature shift loss aims to minimize the discrepancy between feature shifts from different modalities.
The "surgery" block dynamically penalizes cross-modal misalignment based on the measured scale of feature shift.
Trích dẫn
"Prompt learning is effective for fine-tuning foundation models to improve their generalization across a variety of downstream tasks."
"In this paper, we first demonstrate that prompt tuning along only one single branch of CLIP (e.g., language or vision) is the reason why the misalignment occurs."
"Our main contribution can be summarized as follows: We systematically and quantitatively explain the reason, namely feature shift, behind the degraded generalizability of VLMs during prompt tuning."