The paper proposes a novel approach called "Morph-Tokens" to resolve the conflicting training objectives between visual comprehension and generation in multimodal large language models (MLLMs).
The key idea is that the pre-MLLM visual tokens are abstract semantics that serve as visual prompts for comprehension tasks, while the post-MLLM visual tokens are visually complete tokens for image generation. This "morph" transformation allows the model to effectively handle both visual comprehension and generation tasks without the inherent conflict.
The authors introduce a 3-stage training strategy to detach the textual and image reconstruction losses using morph-tokens. In the first stage, the model extends the token vocabulary of a pre-trained language model to transition it into an MLLM. The second stage involves auto-encoding morph-tokens, where the pre-MLLM tokens act as visual prompts for comprehension and the post-MLLM tokens are used for image reconstruction. The final stage further enhances the model's capabilities through instruction tuning on diverse vision-language tasks.
Extensive experiments demonstrate that the proposed morph-token-based MLLM outperforms existing MLLMs on a wide range of multimodal comprehension and generation benchmarks. It also exhibits emergent abilities, such as consistently preserving image fidelity in multi-turn image editing scenarios and advanced multimodal in-context learning.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Kaihang Pan,... في arxiv.org 05-06-2024
https://arxiv.org/pdf/2405.01926.pdfاستفسارات أعمق