Adapting Mixture of Vision Experts to Enhance Multimodal Understanding in Large Language Models
MoVA, a powerful multimodal large language model, adaptively routes and fuses task-specific vision experts with a coarse-to-fine mechanism to enhance generalization across diverse image content.