The paper presents a theoretical analysis of the attention mechanism through the lens of exchangeability and latent variable models. The key insights are:
Exchangeability of input tokens (e.g., words in a paragraph or patches in an image) induces a latent variable model, where the latent variable represents the "concept" or "meaning" of the input sequence.
The latent posterior distribution P(z|X), where z is the latent variable and X is the input sequence, is a minimal sufficient statistic for predicting the target variable y. This latent posterior serves as a desired representation of the input.
The attention mechanism can be derived as a specific parameterization of inferring the latent posterior P(z|X). The authors prove that the attention mechanism, with the desired parameter, approximates the latent posterior up to an error that decreases with the input sequence length.
The authors show that both supervised and self-supervised (e.g., masked token prediction) objectives allow empirical risk minimization to learn the desired parameter of the attention mechanism, with generalization errors that are independent of the input sequence length.
The theoretical analysis provides a complete characterization of the attention mechanism as a "greybox" approach, combining the handcrafted architecture induced by the latent variable model and the learnable parameter estimated from data.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы