Conceitos essenciais
Large Language Models (LLMs) can provide valuable visual descriptions and knowledge to enhance the performance of pre-trained vision-language models like CLIP in low-shot image classification tasks.
Resumo
The paper discusses the integration of Large Language Models (LLMs) to improve the performance of pre-trained vision-language models, specifically on low-shot image classification tasks.
Key highlights:
- Low-shot image classification tasks, including few-shot and zero-shot variants, rely heavily on category names as the source of class-specific knowledge, resulting in a shortage of distinguishable descriptions.
- LLMs, trained on large text corpora, can provide rich visual descriptions for fine-grained object categories that can be leveraged to enhance text prompts.
- The authors propose the LLaMP framework, which treats LLMs as prompt learners for the CLIP text encoder, to effectively adapt LLMs for image classification without fully fine-tuning the language model.
- Experiments show that LLaMP outperforms state-of-the-art prompt learning methods on both zero-shot generalization and few-shot image classification across a spectrum of 11 datasets.
- The authors also conduct extensive analysis to investigate the effectiveness of each component of LLaMP and discuss the optimal setup for LLM-aided image classification.
Estatísticas
"The Yak-40 has a unique trijet configuration with a large passenger window section and a sloping nose, along with three engines mounted on the rear of the aircraft, creating an unmistakable silhouette in the sky."
"By simply incorporating noun phrases extracted from a LLM's response, the performance of the ordinary CLIP models is improved by more than 1% without any training."
Citações
"To the best of our knowledge, we are the first to investigate how to use the encyclopedic knowledge inherent in Large Language Models (LLMs) to enhance low-shot image classification."
"We design a framework, LLaMP, to effectively adapt LLMs for image classification, without training the entire language model, and achieve state-of-the-art in both few-shot and zero-shot settings."