The paper proposes MM-Interleaved, an end-to-end generative model for processing interleaved image-text data. The key contributions are:
Multi-Modal Feature Synchronizer (MMFS): MMFS is designed to reduce the number of visual tokens required by multi-modal language models, enabling efficient extraction of fine-grained visual details from multi-scale and multi-image feature maps.
MM-Interleaved Architecture: The proposed architecture integrates a visual foundation model, a large language model, and a diffusion model. It leverages MMFS to allow the language model to dynamically access detailed image features during generation, overcoming the limitations of fixed visual token inputs.
Comprehensive Evaluation: MM-Interleaved is pre-trained on a mixture of image-text pairs and interleaved image-text sequences, and further fine-tuned on various downstream tasks. It achieves state-of-the-art results on a wide range of multi-modal comprehension and generation benchmarks without using any in-house data.
The paper first formulates the task of generative modeling for interleaved image-text data. It then introduces the detailed architecture of MM-Interleaved, including the MMFS module, the multi-modal language model, and the image decoder. The model is pre-trained on a diverse set of image-text datasets and further fine-tuned on various downstream tasks. Extensive experiments demonstrate the effectiveness of the proposed approach, showing its superior performance on multi-modal comprehension, text generation, and image generation tasks compared to previous methods.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問