Основные понятия
The author proposes a novel framework for multi-label image recognition without training data, leveraging Large Language Models (LLMs) to adapt Vision-Language Models (VLMs) through prompt tuning.
Аннотация
The paper introduces a data-free framework for multi-label image recognition that utilizes pre-trained Large Language Models (LLMs) to adapt Vision-Language Models (VLMs) like CLIP. By querying LLMs with specific questions and learning hierarchical prompts, the method achieves promising results on three benchmark datasets. The proposed approach demonstrates the effectiveness of leveraging comprehensive knowledge from LLMs in enhancing multi-label image recognition without the need for training data.
The study explores synergies between multiple pre-trained models and emphasizes the importance of considering relationships between object categories in prompt learning. Extensive experiments show improvements over existing methods, especially outperforming zero-shot approaches by 4.7% in mAP on MS-COCO dataset.
Статистика
Extensive experiments on three public datasets (MS-COCO, VOC2007, and NUS-WIDE).
Achieved better results than state-of-the-art methods, outperforming zero-shot multi-label recognition methods by 4.7% in mAP on MS-COCO.
Proposed hierarchical prompt learning method incorporating multi-label dependency.
Utilized knowledge from pre-trained Large Language Model (LLM) for prompt tuning CLIP.
Introduced a new way to explore synergies between multiple pre-trained models for novel category recognition.
Цитаты
"We propose a data-free framework for multi-label image recognition without any training data."
"Our method presents a new way to explore the synergies between multiple pre-trained models."
"Our framework introduces a promising avenue for handling new objects in visual recognition."