Mamba models can be viewed as attention-driven models, shedding light on their inner workings and comparison to transformers.
MambaMixer, a new architecture with data dependent weights, uses a dual selection mechanism across tokens and channels to efficiently and effectively select and mix (resp. filter) informative (resp. irrelevant) tokens and channels.
This research paper introduces a mathematical framework for understanding how selective state space models (SSMs) compress memory, balancing information retention with computational efficiency for improved sequence modeling.
選擇性狀態空間模型通過動態過濾和更新隱藏狀態,在不犧牲模型性能的情況下實現了高效的記憶體壓縮,為處理長序列數據提供了更有效和可擴展的方案。
This paper investigates the dynamical properties of tokens in pre-trained Mamba models, revealing that token dynamics are governed by model parameters and impact model performance, leading to refinements like excluding convergent scenarios and reordering tokens based on importance scores.