Concepts de base
A conceptually simple yet effective multilingual CLIP compression framework that trains a lightweight multilingual vision-language model, DC-CLIP, for both Chinese and English contexts through progressive knowledge distillation and alignment.
Résumé
The paper introduces DC-CLIP, a novel, lightweight variant of the CLIP model, which is designed to significantly reduce the model's computational resource and storage demands without sacrificing performance, while enabling its application in both Chinese and English contexts.
The framework consists of two main stages:
-
Vision-Language Feature Distillation:
- The image encoder and text encoder of pre-trained AltCLIP models are used as teacher models.
- Lightweight student models are designed to learn robust visual and multilingual textual feature representation abilities from the corresponding teacher models through feature distillation.
-
Vision-Language Feature Alignment:
- The distilled image and text features are further aligned using contrastive learning on a small-scale Chinese and English image-text dataset.
- This stage enhances the model's precision in image-text matching and comprehension in multilingual contexts.
Comprehensive experiments on the ELEVATER benchmark demonstrate that DC-CLIP achieves superior performance in the English context and competitive performance in the Chinese context, even with less training data, compared to existing models of similar parameter magnitude. This showcases the effectiveness of the proposed training mechanism.
Stats
DC-CLIP achieves state-of-the-art performance on the ImageNet dataset in the English context.
DC-CLIP outperforms baseline models on 6 out of 8 English datasets and is competitive on the remaining 2 datasets.
In the Chinese context, DC-CLIP outperforms the baseline models on 5 out of 8 datasets.
Citations
"DC-CLIP maintains robust performance in Chinese and English settings while adapting to the resource limitations of mobile devices such as smartphones."
"This breakthrough not only epitomizes technological innovation in multilingual vision-language models but also carves new avenues for deploying efficient and pragmatic multimodal model applications on a plethora of mobile and edge devices in the future."