Language supervision and diverse training data play a crucial role in enhancing CLIP's compositional generalization abilities.
Despite advancements in other areas, current vision-language models (VLMs) exhibit a critical weakness in spatial reasoning, mirroring deficits observed in humans with constructive apraxia, a cognitive disorder.
Current vision-language models (VLMs) lack the ability to effectively filter out irrelevant visual information when presented with multiple images, hindering their performance on tasks requiring long-context reasoning.
視覚と言語モデル(VLM)は、自己生成した自己修正データから学習することで、外部フィードバックなしに自己改善し、より正確な応答を直接生成できるようになる。
TextHawk2 is a novel bilingual vision-language model that excels in OCR, grounding, and general multimodal understanding tasks while using significantly fewer image tokens compared to previous models.
This research introduces ELVA (Efficient Language and Vision Assistant), a suite of Vision-Language Models (VLMs) designed to achieve high performance in visually-situated Natural Language Understanding (NLU) tasks while minimizing inference costs, particularly focusing on efficient handling of high-resolution images with text.
AWT, a novel adaptation framework, enhances the performance of pre-trained vision-language models (VLMs) in zero-shot and few-shot image classification tasks by augmenting inputs with diverse visual and textual information, dynamically weighting their importance, and leveraging optimal transport to capture cross-modal correlations.
AWTは、事前学習済みビジョン言語モデル(VLM)の適応能力を高める、訓練不要な新しいフレームワークであり、画像変換と大規模言語モデルを通じて多様な視覚的視点と豊富なクラス記述で入力を拡張し、予測エントロピーに基づいて入力を動的に重み付け、ビジョン言語空間における意味的相関をマイニングするために最適輸送を採用している。
While Vision-Language Models (VLMs) excel in tasks like image retrieval and VQA, they struggle with mathematical reasoning; this research finds that task-specific prompting, rather than captioning, is more effective in improving VLM performance for such tasks.
GLOV leverages large language models (LLMs) as implicit optimizers to discover highly effective prompts for vision-language models (VLMs), significantly improving performance on downstream vision tasks like image classification without requiring gradient-based learning.