Multi-modal Large Language Models (MLLMs) exhibit significant redundancy in their processing of visual information, and this redundancy can be leveraged to develop more efficient training and inference methods without significantly impacting performance.


coremsg

analyzing-visual-token-redundancy-in-multi-modal-large-language-models-for-efficient-training-and-inference


Analyzing Visual Token Redundancy in Multi-modal Large Language Models for Efficient Training and Inference


title_rewrite


LongLLaVA, a novel hybrid architecture combining Mamba and Transformer blocks, effectively scales multi-modal large language models to handle a high volume of images (up to 1000) efficiently, achieving competitive performance in long-context understanding tasks while minimizing computational costs.


longllava-a-hybrid-architecture-for-efficiently-scaling-multi-modal-large-language-models-to-1000-images-


LongLLaVA: A Hybrid Architecture for Efficiently Scaling Multi-Modal Large Language Models to 1000 Images