المفاهيم الأساسية
Leveraging complementary sources of information - descriptions of categories generated by large language models (LLMs) and abundant, fine-grained image classification datasets - to improve the zero-shot classification performance of vision-language models (VLMs) across fine-grained domains.
الملخص
The content discusses a method to improve the zero-shot performance of vision-language models (VLMs) such as CLIP by leveraging two complementary sources of information - descriptions of categories generated by large language models (LLMs) and abundant, fine-grained image classification datasets.
Key highlights:
- Existing VLMs show poor performance in encoding visual attributes in fine-grained domains, beyond simply recognizing the name of the category.
- The authors develop methods to train VLMs with "bag-level" supervision, where a set of images are grouped with a set of descriptions without direct image-text correspondences.
- The authors systematically evaluate the effectiveness of their method by assessing the zero-shot classification performance on novel classes across 12 datasets, including fine-grained domains like iNaturalist and NABirds.
- The authors find that simply using LLM-generated attributes of novel classes does not improve performance, but their training strategy leads to an average improvement of 4-5% in accuracy.
- The authors explore prompting LLMs in various ways to generate descriptions that capture visual appearance, habitat, and geographic regions, and find that geographic priors are equally effective and complementary to visual appearance cues.
- The authors show that their method outperforms prior work on prompt-based tuning of VLMs and also improves performance on the challenging NeWT dataset for tasks beyond categorization.
الإحصائيات
"a medium-sized, stocky sparrow with a rounded head and a short stout beak"
"features brown, streaky back and wings, with white or light underparts that also have defined streaks"
"notable white outer tail feathers visible in flight, a white eye-ring, and a distinct dark shoulder patch"
"prefers open fields, grasslands, and woodland edges"
"birds have feathers, toothless beaks of varied shapes; wings, a common trait even among non-fliers; a streamlined body with an upright, two-legged stance; and eyes on the sides of their heads for wide vision"
اقتباسات
"a photo of a Vesper Sparrow with a small conical beak, brown heavily streaked body, long tail and white eye-ring around its black eye"
"a photo of a Hawk T1 aircraft with a relatively short and stubby fuselage"