The limitations observed in open-vocabulary object detectors regarding fine-grained understanding can be attributed to deficiencies within the CLIP latent space, where fine-grained object properties are not effectively captured by the standard cosine similarity matching.
Current open-vocabulary object detectors struggle to accurately capture and distinguish fine-grained object details like color, material, pattern, and transparency.
The core message of this paper is to propose a universal and explicit approach that enhances the fine-grained attribute detection capabilities of mainstream open-vocabulary object detection (OVD) models by highlighting fine-grained attributes in an explicit linear space.