EVE is an Efficient Vision-Language foundation model that encodes both vision and language using a shared Transformer network integrated with Modality-Aware MoE modules. By unifying pre-training tasks and incorporating masked signal modeling, EVE accelerates training speed and improves downstream task performance.
The paper discusses the challenges in building scalable vision-language models and introduces EVE as a solution. The model architecture combines vision and language encoding through a shared Transformer network with Modality-Aware MoE modules. By focusing on masked signal modeling, EVE achieves faster training speeds compared to traditional methods like Image-Text Contrastive Learning and Image-Text Matching losses.
Additionally, the study explores different pre-training tasks for vision-language models, highlighting the importance of efficient and scalable approaches. The proposed unified approach simplifies pre-training into a single objective, enhancing performance while reducing complexity. Through extensive experiments, EVE demonstrates superior results on various vision-language tasks such as visual question answering, visual reasoning, and image-text retrieval.
翻譯成其他語言
從原文內容
arxiv.org
深入探究