المفاهيم الأساسية
This research paper introduces a novel prompt tuning method for Vision-Language Models (VLMs) that enhances their ability to generalize to unseen classes while maintaining strong performance on seen classes.
الإحصائيات
In the 16-shot setting, the accuracy of the base class using KgCoOp drops by 1.73% compared to CoOp.
In the 16-shot setting, the accuracy of new classes using ProGrad is 3.12% lower than KgCoOp.
CoOp, CoCoOp, ProGrad, and KgCoOp surpass zero-shot CLIP in accuracy on base classes by 6.84%, 6.68%, 8.86%, and 7.29%, respectively.
CoOp, CoCoOp, ProGrad, and KgCoOp exhibit a drop in accuracy on new classes of 11.40%, 8.93%, 5.30%, and 1.02%, respectively.
اقتباسات
"While hand-crafted or template-based prompts can be applied more broadly to unseen classes, they often result in poor performance in downstream tasks (i.e., seen classes). Conversely, soft prompts tend to perform well in downstream tasks, but their lack of the generalizability stems from overfitting to the prompting seen classes."
"One may ask: 'What is the affecting of class-wise augmentation for CoOp-based methods and other approaches that employ hand-crafted general knowledge?'"