Cross-Modal Adapter: A Parameter-Efficient Transfer Learning Approach for Improving Vision-Language Models
The proposed XMAdapter establishes cache models for both text and image modalities, leveraging retrieval through visual-language bimodal information to achieve cross-modal fusion and enhance model performance.