Zhu, Y., Ji, Y., Zhao, Z., Wu, G., & Wang, L. (2024). AWT: Transferring Vision-Language Models via Augmentation, Weighting, and Transportation. In Advances in Neural Information Processing Systems.
This paper introduces AWT, a novel framework designed to improve the adaptability of pre-trained vision-language models (VLMs) for downstream tasks, particularly in zero-shot and few-shot learning scenarios. The authors aim to address the limitations of VLMs in focusing on task-specific details and effectively transferring knowledge to new concepts.
AWT employs a three-pronged approach:
The researchers implemented AWT using the CLIP model and evaluated its performance on 21 datasets across four challenging tasks: zero-shot and few-shot image classification, out-of-distribution generalization, and zero-shot video action recognition.
The study underscores the effectiveness of AWT in enhancing the transferability and performance of pre-trained VLMs. By augmenting inputs, dynamically weighting their importance, and leveraging optimal transport, AWT enables VLMs to better focus on task-specific details and transfer knowledge to new concepts, leading to substantial improvements in various vision-language tasks.
This research significantly contributes to the field of vision-language understanding by introducing a novel and effective adaptation framework for VLMs. AWT's ability to enhance zero-shot and few-shot learning capabilities has the potential to broaden the applicability of VLMs in real-world scenarios where labeled data is scarce.
While AWT shows promising results, the authors acknowledge limitations in handling low-resolution images and suggest exploring alternative augmentation techniques like diffusion models. Future research directions include investigating AWT's applicability to other VLM architectures and exploring its potential in more complex vision-language tasks.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問