Image First or Text First? How Modality Sequencing in Multi-Modal Prompts Affects Reasoning Performance of Large Language Models
The sequencing of images and text in multi-modal prompts significantly influences the reasoning performance of large language models (LLMs), particularly for simpler tasks, highlighting the importance of aligning modality sequence with reasoning flow and suggesting potential for optimizing multi-modal prompt design.