The paper introduces a data-free framework for multi-label image recognition that utilizes pre-trained Large Language Models (LLMs) to adapt Vision-Language Models (VLMs) like CLIP. By querying LLMs with specific questions and learning hierarchical prompts, the method achieves promising results on three benchmark datasets. The proposed approach demonstrates the effectiveness of leveraging comprehensive knowledge from LLMs in enhancing multi-label image recognition without the need for training data.
The study explores synergies between multiple pre-trained models and emphasizes the importance of considering relationships between object categories in prompt learning. Extensive experiments show improvements over existing methods, especially outperforming zero-shot approaches by 4.7% in mAP on MS-COCO dataset.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Shuo Yang,Zi... at arxiv.org 03-05-2024
https://arxiv.org/pdf/2403.01209.pdfDeeper Inquiries