The paper introduces DiJiang, a method for efficiently processing and analyzing content in Transformer-based language models. The key highlights and insights are:
The authors identify the main source of approximation error in existing linear attention schemes as the use of Monte Carlo sampling. To address this, they propose the use of weighted Quasi-Monte Carlo sampling, which offers superior approximation efficiency compared to traditional Monte Carlo methods.
The authors leverage Discrete Cosine Transform (DCT) to map the queries and keys of the Transformer's attention mechanism to the frequency domain. This mapping effectively eliminates the softmax operation, reducing the computational complexity of the attention mechanism from quadratic to linear.
Theoretically, the authors demonstrate that the frequency domain mapping is an approximate equivalent to the original attention mechanism. They also show that the weighted Quasi-Monte Carlo sampling can provide tighter error bounds compared to the standard Positive Fixed Features (PFF) kernel.
The experimental results across different model sizes, from 70M to 2.8B parameters, show that the proposed DiJiang method achieves comparable performance to the original Transformer models, but with significantly reduced training costs (up to 1/16) and much faster inference speeds (up to 10x).
The authors also compare DiJiang with other linear attention methods, such as Linformer, Cosformer, and Performer, and show that their approach outperforms these methods in terms of both convergence speed and final model accuracy.
The visualization of attention maps further demonstrates the effectiveness of the DiJiang method in accurately approximating the original attention mechanism, which is crucial for maintaining the performance of Transformer models.
Overall, the DiJiang method represents a significant advancement in the development of efficient and scalable Transformer models, promising wider applicability and facilitating advancements in various natural language processing tasks.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询