Core Concepts
Transformer-based models, including powerful large language models (LLMs), suffer from a preset length limit and cannot generalize well from short training sequences to longer inference ones. Researchers have proposed various methods to enhance the length extrapolation capability of Transformers, focusing on the design of positional encodings (PEs).
Abstract
This survey presents the advances towards length extrapolation of Transformers from the perspective of positional encoding (PE). It first introduces different types of PEs, including absolute PEs and relative PEs, that can enable better length extrapolation. Then, it delves into the extrapolation methods based on these PEs, covering position interpolation and randomized position techniques.
The key highlights and insights are:
Sinusoidal absolute PE, the first PE proposed for Transformers, has poor extrapolation performance. Researchers have explored ways to improve it, either by incorporating shift invariance or generating position embeddings that vary smoothly with position.
Relative PEs are believed to be more robust to input length change and theoretically capable of running on unseen lengths. Various novel RPEs have been proposed, with the goal of better modeling position information for enhanced extrapolation.
In the era of large language models (LLMs), position interpolation methods and randomized PEs have emerged as the frontiers of length extrapolation research. Position interpolation aims to scale position representations to match longer context windows, while randomized PEs expose the model to a much wider range of positions during training.
Despite the progress, there are still significant challenges in establishing a solid theoretical foundation for length extrapolation, designing comprehensive evaluation benchmarks, and exploring broader perspectives beyond positional encoding.