Core Concepts
Efficiently integrating visual and linguistic information in a linear computational complexity model.
Abstract
The content discusses the development of Cobra, a multi-modal large language model with linear computational complexity. It explores the fusion of visual and linguistic information, showcasing competitive performance and efficiency compared to existing models. The study delves into various modal fusion schemes, highlighting Cobra's ability to overcome visual illusions and spatial relationship judgments. Experiments demonstrate its effectiveness across multiple benchmarks.
Introduction
Large language models (LLMs) have transformed natural language understanding.
Shift towards general large-scale models like ChatGPT.
Related Work
Emergence of large language models like ChatGPT.
Trend towards investigating small-scale alternatives.
Cobra: Multimodal Large Language Model
Preliminaries on state space models and selective SSMs.
Architecture involving vision encoder, projector, and Mamba backbone.
Training Recipe
Fine-tuning process over two epochs on combined datasets.
Experiments
Evaluation on six benchmarks showcasing Cobra's performance against other VLMs.
Inference speed comparison with TinyLLaVA and MobileVLM v2.
Results
Cobra demonstrates competitive performance with fewer parameters than existing models.
Ablation Studies
Investigation into vision encoders, projectors, and pre-trained base Mamba models.
Case Studies
Examples demonstrating Cobra's superior understanding of spatial relationships and scene descriptions.
Limitations
Weaker text recognition compared to baseline models.
Sensitivity to numerical precision during inference.
Conclusion
Summary of Cobra's contributions in enhancing efficiency in multi-modal language modeling.
Stats
"Cobra achieves extremely competitive performance with current computationally efficient state-of-the-art methods."
"Cobra performs 3× ∼4× faster than MobileVLM v2 3B."
Quotes
"We propose Cobra, a novel MLLM with linear computational complexity."
"Extensive experiments demonstrate that Cobra achieves extremely competitive performance."