Core Concepts
MoCLE proposes a novel architecture to address task conflicts in vision-language instruction tuning, achieving specialization and generalization simultaneously.
Abstract
The content introduces MoCLE, a Mixture of Cluster-conditional LoRA Experts, for vision-language instruction tuning. It addresses task conflicts in Large Vision-language Models (LVLMs) by activating task-customized model parameters based on instruction clusters. The paper includes an abstract, introduction, related work, methodology, experiments, ablation studies, visualizations, conclusions, and acknowledgments. Key insights include:
- Introduction of MoCLE for LVLM instruction tuning.
- Explanation of the clustering process for instructions.
- Description of the MoCLE architecture with task experts and a universal expert.
- Evaluation results showing performance gains on various tasks.
- Ablation studies on components like LoRA rank and number of clusters.
- Visualizations demonstrating clustering assignments and routing decisions.
Stats
Extensive experiments on 11 zero-shot tasks demonstrate the effectiveness of MoCLE.
InstructBLIP model is used as the base LVLM for evaluation.
Quotes
"We propose Mixture of Cluster-conditional LoRA Experts (MoCLE) for vision-language instruction tuning."
"MoCLE achieves significant improvement on held-out tasks including image captioning and visual question answering."