The paper proposes LP++, a generalized linear probe for efficient few-shot adaptation of vision-language models like CLIP. The key insights are:
The standard linear probe (LP) baseline, which only uses the visual features, has been underestimated in the literature. By incorporating learnable blending of visual and text features, LP++ achieves highly competitive few-shot performance.
LP++ uses a block coordinate Majorize-Minimize (BMM) optimization procedure with data-driven, task-specific step sizes computed from the Lipschitz continuity of the objective. This removes the need for intensive hyperparameter tuning on validation sets, making LP++ computationally efficient.
The paper also provides data-informed initializations for the visual prototypes and blending parameters, further improving optimization and performance.
Compared to recent prompt learning and adapter-based methods, LP++ achieves state-of-the-art few-shot performance on 11 benchmark datasets, while being orders of magnitude faster and operating in a black-box setting without accessing the internal representations of the pre-trained models.
To Another Language
from source content
arxiv.org
ข้อมูลเชิงลึกที่สำคัญจาก
by Yunshi Huang... ที่ arxiv.org 04-04-2024
https://arxiv.org/pdf/2404.02285.pdfสอบถามเพิ่มเติม