Multimodal Large Language Model for Interleaved Text-Image Generation
M2Chat, a novel unified multimodal LLM framework, enables seamless interleaved text-image generation across diverse scenarios by efficiently integrating low-level visual information and high-level semantic features through an innovative Multimodal Multi-level Adapter (M3Adapter) and a two-stage Multimodal Mixed Fine-Tuning (M3FT) strategy.