MM-PlanLLM, a multimodal architecture that enables large language models to comprehend and guide users through complex procedural plans by leveraging both textual and visual information.
Multimodal Large Language Models (MLLMs) have emerged as a promising approach to achieving artificial general intelligence by leveraging the power of large language models and multimodal reasoning. This survey provides a comprehensive overview of the recent progress in MLLMs, including key techniques such as Multimodal Instruction Tuning, Multimodal In-Context Learning, Multimodal Chain of Thought, and LLM-Aided Visual Reasoning.