Core Concepts
The core message of this paper is to leverage synthetic captions generated by pre-trained vision-language models and a novel hyperbolic vision-language learning approach to boost the performance of open-world object detection, which aims to detect and localize any object using either class labels or free-form texts.
Abstract
The paper proposes a novel approach, called "HyperLearner", to tackle the task of open-world object detection. The key highlights are:
Bootstrapping synthetic captions: The authors propose to leverage pre-trained vision-language models to automatically generate dense synthetic captions on different regions of an image, providing rich open-vocabulary descriptions of both seen and unseen objects.
Hyperbolic vision-language learning: To mitigate the noise caused by hallucination in synthetic captions, the authors introduce a novel hyperbolic vision-language learning objective. This objective aligns the visual and caption embeddings in a hierarchical structure, where the caption embedding entails the visual object embedding, enforcing a meaningful structural hierarchy.
Extensive experiments: The authors evaluate their approach on a wide variety of open-world detection benchmarks, including COCO, LVIS, Object Detection in the Wild (ODiW), and RefCOCO/+/g. The results show that their HyperLearner consistently outperforms existing state-of-the-art methods when using the same backbone.
Ablation studies: The authors conduct thorough ablation studies to analyze the contributions of different components, including the hyperbolic learning objective, region sampling for synthetic captions, and the cross-modal attention module.
Qualitative analysis: The authors provide qualitative results demonstrating the strong generalization capability of their model in detecting and localizing novel objects using both class labels and free-form texts.
Stats
"a furry black-and-white bear is sitting and eating bamboo"
"a dog is sitting quietly on the grass"
"a traditional temple"