Enhancing Multimodal Large Language Models with Set-of-Mark Prompting: A New Data Source and Learning Paradigm
Introducing a new learning paradigm called "list items one by one" to effectively train multimodal large language models (MLLMs) on Set-of-Mark (SoM) visual prompting, which significantly improves their visual reasoning capabilities.