Transformerモデルの学習バイアスについて、高感度な関数を表現するためには、非常に鋭い最小値が必要であることが理論的に証明されました。この結果は実証的な調査と一致し、過去の理論的研究では説明されていなかった多様な実証結果を説明します。Transformerの能力を理解するために、従来の理論から量的な境界と損失ランドスケープの形状を研究することへのシフトが提案されています。
Empirical studies have identified a range of learnability biases and limitations of transformers, such as a persistent difficulty in learning to compute simple formal languages such as PARITY, and a bias towards low-degree functions. Theoretical understanding remains limited, with existing expressiveness theory either overpredicting or underpredicting realistic learning abilities. The loss landscape under the transformer architecture is constrained by input-space sensitivity, leading to isolated points in parameter space for transformers whose output is sensitive to many parts of the input string. This results in a low-sensitivity bias in generalization.
Given dramatic advances in machine learning applications powered by transformer models, there has been substantial interest in understanding which functions are easier or harder to learn and represent using transformers. Empirical research on both formal languages and synthetic functions has uncovered an intriguing array of learning biases, but theoretical understanding is lacking.
While substantial theoretical work has considered both the learnability and the expressiveness of transformers, existing theoretical studies do not consistently explain such learning biases. Transformers fitting high-sensitivity functions must inhabit very steep minima, explaining both difficulty in training and length generalization for PARITY.
Some prior work has studied the learnability of problems for transformers. For example, Edelman et al. bound the statistical capacity of the transformer architecture, showing that on those functions that transformers prefer to represent, they can generalize with good sample efficiency.
Notably, sparse parities could indeed be learned well by transformers. However, this result does not prove that PARITY or other highly sensitive functions are hard to learn.
Other work has studied simplified setups such as linear attention or individual attention layers. Here, we provide results that have direct bearing on the learnability of PARITY and other sensitive functions, characterizing the loss landscape of transformers in terms of input-space sensitivity.
Our results show that it is overcome by scaling the number of computation steps with the input length.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询