Основні поняття
This paper investigates the dynamical properties of tokens in pre-trained Mamba models, revealing that token dynamics are governed by model parameters and impact model performance, leading to refinements like excluding convergent scenarios and reordering tokens based on importance scores.
Анотація
Bibliographic Information:
Vo, T. N., Pham, T. D., Tong, X. T., & Nguyen, T. M. (2024). Demystifying the Token Dynamics of Deep Selective State Space Models. arXiv preprint arXiv:2410.03292.
Research Objective:
This paper aims to analyze the dynamical properties of tokens in pre-trained Mamba models, a type of deep selective state space model (SSM), and understand how these properties influence model performance.
Methodology:
The authors derive a dynamical system representing the continuous-time limit of the Mamba model. They analyze the asymptotic behavior of solutions to this system, focusing on the convergence or divergence of tokens and their corresponding hidden attention scores. This analysis is primarily conducted for the one-dimensional case, with empirical observations suggesting similar behavior in higher dimensions.
Key Findings:
- The dynamics of tokens in Mamba models can be categorized into convergence and divergence scenarios, determined by model parameters and input data.
- In the convergence scenario, all tokens and hidden attention scores collapse to zero, potentially hindering model performance.
- In the divergence scenario, tokens diverge to infinity at varying rates, indicating unequal contributions to model updates during training.
- Reordering tokens based on their importance scores, derived from model parameters and initial token values, can improve model performance.
Main Conclusions:
- Understanding the dynamical properties of tokens in Mamba models provides valuable insights for improving their performance.
- Excluding the convergence scenario and implementing token reordering based on importance scores are two promising refinements for enhancing Mamba's effectiveness.
Significance:
This research contributes to a deeper theoretical understanding of deep selective state space models, particularly Mamba, and offers practical refinements for improving their performance in various applications.
Limitations and Future Research:
- The theoretical analysis primarily focuses on the one-dimensional case, with empirical observations suggesting similar behavior in higher dimensions. Further investigation is needed to rigorously extend these results to higher dimensions.
- The study focuses on the S6 layer, a core component of Mamba. Exploring the impact of other layers and token-wise operations on token dynamics is an area for future research.
Статистика
The model with only positive eigenvalues in the input-output matrix achieved a perplexity of 16.71 on the WikiText103 language modeling task.
The model with a mix of positive and negative eigenvalues achieved a perplexity of 16.84.
The model with only negative eigenvalues achieved a perplexity of 17.26.
The MambaVision model with token reordering achieved a top-1 accuracy of 82.02% on the ImageNet-1K image classification task.
The baseline MambaVision model achieved a top-1 accuracy of 81.90%.