Multi-modal Large Language Models (MLLMs) exhibit significant redundancy in their processing of visual information, and this redundancy can be leveraged to develop more efficient training and inference methods without significantly impacting performance.
LongLLaVA, a novel hybrid architecture combining Mamba and Transformer blocks, effectively scales multi-modal large language models to handle a high volume of images (up to 1000) efficiently, achieving competitive performance in long-context understanding tasks while minimizing computational costs.