The paper introduces DREAMLLM, a learning framework that focuses on achieving versatile Multimodal Large Language Models (MLLMs) empowered with synergy between multimodal comprehension and creation. The framework operates on two fundamental principles: generative modeling of language and image posteriors by direct sampling in the raw multimodal space, fostering the generation of raw, interleaved documents. DREAMLLM is capable of generating free-form interleaved content, showcasing superior performance as a zero-shot multimodal generalist.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Runpei Dong,... lúc arxiv.org 03-19-2024
https://arxiv.org/pdf/2309.11499.pdfYêu cầu sâu hơn