Efficient and Accurate Image-Text Models: MobileCLIP Leverages Multi-Modal Reinforced Training
The authors introduce MobileCLIP, a new family of efficient image-text models optimized for runtime performance, along with a novel multi-modal reinforced training approach that leverages knowledge transfer from an image captioning model and an ensemble of strong CLIP encoders to improve the accuracy of efficient models.