Core Concepts
Transferable Visual Prompting (TVP) can effectively improve the performance of diverse Multimodal Large Language Models (MLLMs) on a wide range of tasks by optimizing a set of shared visual prompts.
Abstract
The paper explores a novel setting where the goal is to improve the performance of diverse MLLMs on a specific downstream task by optimizing a set of shared parameters, rather than fine-tuning each model independently.
The key insights are:
- Existing visual prompting methods can effectively enhance the performance of the model used for prompt training, but the learned prompts often fail to transfer well to other MLLMs due to "cross-model feature corruption".
- To address this issue, the authors propose Transferable Visual Prompting (TVP), which integrates two novel strategies:
- Feature Consistency Alignment (FCA): Imposes constraints on the prompted features to maintain task-agnostic knowledge and prevent excessive feature changes.
- Task Semantics Enrichment (TSE): Leverages CLIP to explicitly embed task-specific semantics into the visual prompts.
- Extensive experiments on 10 diverse datasets covering visual recognition, counting, reasoning, and hallucination tasks demonstrate that TVP can effectively boost the performance of 6 modern MLLMs, significantly outperforming existing visual prompting baselines.
- TVP exhibits good generalization across datasets and robustness to image corruptions, highlighting its practicality in real-world scenarios.
- Compared to fine-tuning methods, TVP provides a more resource-friendly and flexible solution to adapt different models for a given task simultaneously.
Stats
The visual prompts trained on MiniGPT-4 can improve the top-1 accuracy of InstructBLIP on CIFAR-10 by 36.4%.
The visual prompts trained on InstructBLIP can improve the top-1 accuracy of BLIVA on ImageNette by 25.4%.
The visual prompts trained on the ensemble of MiniGPT-4 and InstructBLIP can improve the top-1 accuracy of BLIP2 on SVHN by 28.8%.
The visual prompts trained on the ensemble of MiniGPT-4 and InstructBLIP can improve the AUC score of VPGTrans on Hatefulmemes by 26.9%.
Quotes
"The visual prompts trained with one single model can facilitate the overall performance of 6 modern MLLMs on 10 datasets ranging from visual tasks like recognition and counting to multimodal reasoning and hallucination correction, which significantly surpasses the existing visual prompting baselines."
"TVP can enhance different models with diverse data scales, generalize to different datasets, and resist image corruptions, emphasizing the practicality of our method in real scenarios."