Yang, K., Ackermann, J., He, Z., Feng, G., Zhang, B., Feng, Y., Ye, Q., He, D., & Wang, L. (2024). Do Efficient Transformers Really Save Computation? In Proceedings of the 41st International Conference on Machine Learning (PMLR 235).
This research paper investigates whether "efficient" Transformer architectures, specifically Sparse Transformer and Linear Transformer, truly offer computational advantages over standard Transformers in complex reasoning tasks modeled as Dynamic Programming (DP) problems. The authors aim to theoretically and empirically analyze the capabilities and limitations of these efficient Transformers in terms of their reasoning ability and computational complexity.
The authors theoretically analyze the expressiveness and complexity of Sparse and Linear Transformers by modeling reasoning tasks as DP problems, focusing on the required model size (hidden dimension) in relation to the problem scale (sequence length). They further conduct experiments on three representative DP tasks: Arithmetic expression evaluation, Longest Increasing Subsequence (LIS), and Edit Distance (ED), comparing the performance of efficient Transformers against standard Transformers across varying problem sizes and embedding dimensions.
The study reveals that the presumed efficiency of Sparse and Linear Transformers does not always hold true, particularly for complex reasoning tasks with limited locality. While these architectures remain expressive, their computational advantage diminishes as the problem size grows, making them comparable to standard Transformers in certain scenarios.
This research provides valuable insights into the practical limitations of popular efficient Transformer architectures, challenging the notion that they always offer computational benefits. The findings highlight the importance of considering problem characteristics, such as locality, when choosing appropriate Transformer models for complex reasoning tasks.
The study focuses on specific efficient Transformer designs and DP-based reasoning tasks. Further research could explore other efficient architectures and task domains to generalize the findings. Investigating the tightness of the complexity lower bound for Linear Transformers, particularly under the locality assumption, remains an open question.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Kai Yang, Ja... at arxiv.org 11-12-2024
https://arxiv.org/pdf/2402.13934.pdfDeeper Inquiries