Current open-vocabulary object detectors struggle to accurately capture and distinguish fine-grained object details like color, material, pattern, and transparency.
The authors propose a novel framework, Retrieval-Augmented Losses and visual Features (RALF), that retrieves vocabularies and concepts from a large vocabulary set and augments losses and visual features to improve the generalizability of open-vocabulary object detectors.