Core Concepts
Visual tuning techniques enable efficient reuse of knowledge from large pre-trained foundation models for downstream visual tasks, reducing the need to retrain entire models.
Abstract
This paper provides a comprehensive overview of recent advancements in visual tuning techniques. It categorizes these techniques into five main groups:
Fine-tuning: Updating all parameters of the pre-trained model or just the task-specific head. This is the standard transfer learning approach but becomes less practical as models scale up.
Prompt Tuning: Reformulating downstream tasks as the original pre-training task by designing prompts, either vision-driven, language-driven, or a combination of both. This allows the pre-trained model to be efficiently adapted with minimal tunable parameters.
Adapter Tuning: Inserting additional trainable parameters into the pre-trained model, which is kept frozen. This includes sequential adapters, parallel adapters, and mixed adapters. Adapters provide a lightweight alternative to full fine-tuning.
Parameter Tuning: Directly modifying the model parameters, such as the bias, weight, or both, to adapt the pre-trained model. This includes techniques like LoRA and Compacter.
Remapping Tuning: Transferring the learned knowledge of a pre-existing model to a new downstream model, either through knowledge distillation, weight remapping, or architecture remapping.
The paper discusses the advantages, disadvantages, and technical details of each category of visual tuning techniques. It also highlights promising future research directions, such as improving the interpretability and controllability of prompts, and further reducing the computational and memory overhead of parameter-efficient tuning methods.
Stats
"Visual tuning techniques can achieve state-of-the-art performance on many vision benchmark datasets with a fraction of the tunable parameters compared to full fine-tuning."
"Recent large vision models have reached over 22 billion parameters, making full fine-tuning increasingly impractical."
Quotes
"Visual tuning techniques enable efficient reuse of knowledge from large pre-trained foundation models for downstream visual tasks, reducing the need to retrain entire models."
"Prompt tuning unifies all downstream tasks into pre-trained tasks via designing specific templates to fully exploit the capabilities of foundation models."
"Adapter tuning provides a lightweight alternative to extensive model fine-tuning by inserting additional trainable parameters into a pre-trained model that has been frozen."