Keskeiset käsitteet
This paper investigates two major approaches to augmenting Transformers with recurrence - depth-wise recurrence (Universal Transformers) and chunk-wise recurrence (Temporal Latent Bottleneck). The authors propose novel extensions to these models, including a global mean-based dynamic halting mechanism for Universal Transformer and an augmentation of Temporal Latent Bottleneck with elements from Universal Transformer. The models are compared and their inductive biases are probed in several diagnostic tasks, revealing the strengths and limitations of each approach.
Tiivistelmä
The paper explores two main approaches to introducing recurrence into Transformers:
Depth-wise Recurrence:
Universal Transformer (UT) applies the same Transformer block repeatedly with a dynamic halting mechanism to adapt to input complexity.
The authors propose Gated Universal Transformer (GUT), which adds a gating mechanism and a global mean-based dynamic halting approach to UT.
Chunk-wise Recurrence:
Temporal Latent Bottleneck (TLB) processes the input sequence in chunks, using a Transformer-based recurrent cell and an internal memory to capture temporal dependencies.
The authors propose Gated Universal Temporal Latent Bottleneck (GUTLB), which combines the chunk-wise recurrence of TLB with the dynamic halting of GUT.
The models are evaluated on several diagnostic tasks:
ListOps-O: GUT performs best in the near-IID setting, but chunk-wise recurrence models (TLB, GUTLB) show better out-of-distribution generalization.
Logical Inference: Similar patterns are observed, with chunk-wise recurrence models struggling more on longer sequences with complex logical structures.
Flip-flop Language Modeling: TLB and GUTLB demonstrate strong robustness to sequence length changes, outperforming the other models.
Long Range Arena (LRA) Text: TLB shows the strongest performance, while GUTLB performs worse due to its increased parameter sharing.
The authors discuss the trade-offs between depth-wise and chunk-wise recurrence, with the former offering more flexibility but potentially being more susceptible to noise, while the latter restricts the attention window but may provide more robustness. They also outline several future research directions, including exploring alternative attention mechanisms, recursive structures, linear RNNs, and large language models with chain-of-thought reasoning.
Tilastot
The training data for ListOps-O has sequences of length up to 100, with a maximum of 5 arguments per list operation.
The Logical Inference task trains on data with up to 6 logical operators and tests on data with 7-12 operators.
The Flip-flop Language Modeling task uses a training sequence length of 512 and a probability of 0.8 for the "ignore" instruction, and tests on sequences of length 512 and 1024 with varying probabilities of the "ignore" instruction.