The paper investigates the challenges faced by current state-of-the-art cross-modal image-text models, such as CLIP, in achieving open-world understanding. The key findings are:
The limitations in fine-grained open-vocabulary object detection are primarily due to issues within the CLIP latent space, rather than the object localization phase. The performance of CLIP on a fine-grained benchmark mirrors the patterns observed in open-vocabulary object detectors that rely on CLIP.
Fine-grained information is present in the CLIP latent space, but it is not effectively extracted by the standard cosine similarity matching. The paper demonstrates that simple linear projections on top of the frozen CLIP encoders can significantly improve fine-grained matching performance, while maintaining reasonable coarse-grained retrieval capabilities.
Applying more complex non-linear architectures, such as MLPs and attention layers, does not provide a clear advantage over the linear projections. This suggests that the fine-grained information is linearly separable within the CLIP latent space, but the standard cosine similarity is insufficient to capture these nuanced differences.
The findings highlight the need for better pre-training strategies to construct more balanced image-text representations that effectively incorporate both fine-grained and coarse-grained features. Additionally, the paper suggests exploring alternative matching functions capable of extracting fine-grained features within the CLIP latent space without the need for task-specific datasets to learn this function.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問