Core Concepts
Mamba models can be viewed as attention-driven models, shedding light on their inner workings and comparison to transformers.
Stats
Mamba models offer a 5x increase in throughput compared to Transformers.
Mamba models generate approximately 100N more attention matrices than self-attention layers.
Quotes
"Selective SSMs are viewed as dual models, training in parallel on the entire sequence via IO-aware parallel scan."
"Mamba models can be viewed as attention-driven models, enabling comparison to self-attention layers in transformers."