Efficient Multilingual Vision-Language Model Compression through Progressive Knowledge Distillation and Alignment
A conceptually simple yet effective multilingual CLIP compression framework that trains a lightweight multilingual vision-language model, DC-CLIP, for both Chinese and English contexts through progressive knowledge distillation and alignment.