Conceitos essenciais
Decoupling visual encoding for understanding and generation tasks in multimodal models significantly improves performance by allowing each pathway to leverage task-specific encoding methods.
Estatísticas
Janus (1.3B parameters) achieved scores of 69.4, 63.7, and 87.0 on MMBench, SEED-Bench, and POPE, respectively, outperforming LLaVA-v1.5 (7B) and Qwen-VL-Chat (7B).
On visual generation benchmarks MSCOCO-30K and GenEval, Janus achieved an FID score of 8.53 and an accuracy of 61%, surpassing text-to-image generative models such as DALL-E 2 and SDXL.
Janus outperforms the previous best unified model, Show-o, by 41% and 30% on the MME and GQA datasets, respectively.
On GenEval, Janus obtains 61% overall accuracy, outperforming Show-o (53%) and some popular generation-only methods, e.g., SDXL (55%) and DALL-E 2 (52%).
Citações
"To the best of our knowledge, we are the first to highlight the importance of decoupling visual encoding within the unified multimodal understanding and generation framework."
"The simplicity, high flexibility, and effectiveness of Janus make it a strong candidate for next-generation unified multimodal models."