Multimodal Large Language Models: Resolving the Conflict Between Visual Comprehension and Generation
Morph-tokens, which transform pre-MLLM visual tokens into non-conflicting post-MLLM visual tokens, enable multimodal large language models to achieve synergy between visual comprehension and generation tasks.