Khái niệm cốt lõi
MMAR introduces a novel framework for joint image-text probabilistic modeling that overcomes information loss inherent in previous methods, achieving superior performance in both image understanding and generation by disentangling the diffusion process from the auto-regressive backbone and utilizing continuous image representations.
Thống kê
MMAR-7B achieves an average score of 46.52 across 18 visual understanding benchmarks.
On the MSCOCO 30k dataset, MMAR-7B achieves a FID score of 17.1.
The allocation ratio of text-to-image and unconditional image generation tasks during training is set to 9:1.
The sample allocation ratio of image generation tasks and image understanding tasks is set to 1:1.
The image mask ratio is adjusted to (0, 1] in the second stage of training.
MMAR-0.5B utilizes a Diffusion MLP with 8 residual blocks and 1024 channels.
MMAR-7B employs a Diffusion MLP with 12 residual blocks and 2048 channels.