المفاهيم الأساسية
Multimodal large language models (LLMs) can significantly improve the accuracy of zero-shot image classification by generating rich textual representations of images, which complement visual features and enhance classification accuracy.
الإحصائيات
The proposed method achieves an average accuracy gain of 4.1 percentage points over ten image classification benchmark datasets.
The method achieves an accuracy increase of 6.8% on the ImageNet dataset.
Gemini Pro was used as the multimodal LLM for generating image descriptions and initial predictions.
CLIP (ViT-L/14) was used as the cross-modal embedding encoder.
The study used ten benchmark datasets, including ImageNet, Pets, Places365, Food-101, SUN397, Stanford Cars, Describable Textures Dataset (DTD), Caltech-101, CIFAR-10, and CIFAR-100.
اقتباسات
"To address this, we propose a novel method that leverages the capabilities of multimodal LLMs to generate rich textual representations of the input images."
"Our method offers several key advantages: it significantly improves classification accuracy by incorporating richer textual information extracted directly from the input images; it employs a simple and universal set of prompts, eliminating the need for dataset-specific prompt engineering; and it outperforms existing methods on a variety of benchmark datasets."