Core Concepts
Advancing zero-shot learning in multi-modal language models through autonomous instruction optimization.
Abstract
VisLingInstruct introduces a novel approach to improving Multi-Modal Language Models (MMLMs) by autonomously evaluating and optimizing instructional texts. The method enhances the synergy between visual perception and linguistic expression, significantly boosting zero-shot performance in visual multi-modal tasks. By refining the architecture of existing models and fine-tuning mechanisms, VisLingInstruct optimizes training and inference processes. The framework includes In-Context Learning (ICL) for instruction comparison optimization, leading to more effective and contextually appropriate instructions. Comprehensive experiments demonstrate substantial improvements over prior state-of-the-art results on various datasets.
Stats
VisLingInstruct achieves a 13.1% increase in accuracy on the TextVQA dataset.
It also shows a 9% improvement on the HatefulMemes dataset.