Core Concepts
Jamba is a novel hybrid language model architecture that combines Transformer and Mamba (state-space) layers, along with a mixture-of-experts (MoE) component, to achieve improved performance and efficiency compared to pure Transformer models.
Abstract
The Jamba model is based on a novel hybrid architecture that combines Transformer layers with Mamba (state-space) layers, as well as a mixture-of-experts (MoE) component. This hybrid design aims to address the limitations of pure Transformer models, such as high memory and compute requirements, especially for processing long contexts.
The key highlights of the Jamba model are:
Hybrid Transformer-Mamba Architecture:
Jamba interleaves blocks of Transformer and Mamba layers, leveraging the benefits of both model families.
The ratio of Transformer to Mamba layers can be adjusted to balance memory usage, efficient training, and long-context capabilities.
Mixture-of-Experts (MoE):
MoE is added to some of the MLP layers, allowing for increased model capacity (total parameters) without a proportional increase in active parameters and compute requirements.
The MoE configuration (number of experts, top experts used per token) can be tuned to balance model capacity, active parameters, and compute.
Evaluation and Performance:
Jamba demonstrates comparable or better performance than state-of-the-art models like Llama-2 and Mixtral on a wide range of academic benchmarks.
On long-context evaluations, Jamba outperforms Mixtral on most datasets, while also providing much better throughput, especially for long contexts.
The 7B-based Jamba model (12B active parameters, 52B total available parameters) is designed to fit in a single 80GB GPU, even with context lengths of up to 256K tokens.
Ablation Studies and Insights:
Experiments show the benefits of combining Transformer and Mamba layers, with an optimal ratio of 1:7.
The hybrid Attention-Mamba architecture exhibits improved in-context learning capabilities compared to pure Mamba models.
MoE further improves the performance of the hybrid Attention-Mamba model at large scale.
Jamba does not require explicit positional information, as the Mamba layers provide implicit position information.
Overall, Jamba demonstrates the potential of hybrid architectures to achieve state-of-the-art performance while maintaining efficiency and flexibility in terms of memory usage and throughput, especially for long-context applications.
Stats
Jamba's 7B-based model has 12B active parameters and 52B total available parameters.
Jamba supports context lengths of up to 256K tokens, which is the longest supported context length for production-grade publicly available models.
Compared to recent open models, Jamba provides a substantial reduction in the KV cache memory requirements, using only 4GB for a 256K token context, while Llama-2 70B requires 128GB and Mixtral 7.2B requires 32GB.
Quotes
"Jamba interleaves blocks of Transformer and Mamba layers, enjoying the benefits of both model families."
"MoE is added in some of these layers to increase model capacity while keeping active parameter usage manageable."
"Remarkably, the model presents strong results for up to 256K tokens context length."