Limitations of CLIP in Fine-Grained Open-World Perception
The limitations observed in open-vocabulary object detectors regarding fine-grained understanding can be attributed to deficiencies within the CLIP latent space, where fine-grained object properties are not effectively captured by the standard cosine similarity matching.