Concepts de base
The core message of this paper is to design fast algorithms for solving regression problems against matrix exponential proxies and attention kernel proxies, which are important components in large language models and attention mechanisms.
Résumé
The paper introduces two types of proxy matrices for the attention matrix, which is a crucial component in large language models:
Matrix Exponential Proxy:
The first proxy is the matrix exponential of A^T A, where A is the input matrix.
The authors consider two regression problems against this proxy:
min_x ||(A^T A)^j x - b||_2
min_x ||A(A^T A)^j x - b||_2
These regressions are essential as the matrix exponential can be approximated term-by-term via these smaller problems.
Attention Kernel Proxy:
The second proxy is the entrywise exponential of the Gram matrix AA^T, denoted as exp(AA^T).
The authors consider the regression problem min_x ||exp(AA^T) x - b||_2, which they call the attention kernel regression problem.
The paper designs fast algorithms for these regression problems, based on sketching and preconditioning techniques. The key ideas are:
For the matrix exponential regressions, the authors analyze the error propagation and develop efficient solvers for the base cases, then generalize the results to higher powers via induction.
For the attention kernel regression, the authors leverage structured sketches to efficiently approximate the exponential kernel and then use preconditioning to speed up the regression.
The proposed algorithms achieve significant speedups compared to the naive approaches, making them suitable for large-scale applications involving attention mechanisms.
Stats
The condition number of the input matrix A is denoted as κ.
The failure probability is denoted as δ.
The accuracy parameter is denoted as ϵ.