מושגי ליבה
The proposed method enhances the object awareness of vision-language models to improve cross-modal image-text retrieval, especially for images containing small objects.
תקציר
The paper proposes an "object-aware query perturbation" (Q-Perturbation) framework to improve the performance of cross-modal image-text retrieval, particularly for images containing small objects.
The key insights are:
- Recent pre-trained vision-language (V&L) models have limited retrieval performance for small objects due to the rough alignment between text and fine-grained localization of small targets in the image.
- The proposed Q-Perturbation increases the object awareness of V&L models by focusing on object information of interest, even when the objects in an image are relatively small.
- Q-Perturbation is a training-free and easy-to-implement method that can be plugged into existing V&L models like BLIP2, COCA, and InternVL, inheriting their impressive performance while improving object-awareness.
- Comprehensive experiments on public datasets demonstrate the effectiveness of the proposed method, outperforming conventional algorithms, especially for images with small objects.
סטטיסטיקה
The ratio of the largest detected object rectangle's area to the entire image's area is less than 10% for the "small object category" dataset.
ציטוטים
"The lack of such object-awareness in V&L models is a major issue, especially for human-centered vision tasks, e.g., image retrieval."
"Our Query Perturbation improves the object-awareness of a V&L model while inheriting the impressive performance of a V&L model."