Improving Cross-Modal Image-Text Retrieval by Enhancing Object Awareness in Vision-Language Models
The proposed method enhances the object awareness of vision-language models to improve cross-modal image-text retrieval, especially for images containing small objects.