A Unified Vision-Language Transformer Model with Enhanced Visual Grounding and Generalization Capabilities
A unified vision-language transformer model, ViLaM, that integrates instruction tuning based on large language models to optimally utilize their knowledge and reasoning capacities for a variety of language and vision tasks, while employing cycle training of referring expressions to address the need for high-quality, paired referring expression datasets.