Recurrent Transformers with Dynamic Halt: Investigating Depth-wise and Chunk-wise Approaches for Improved Adaptability and Generalization
This paper investigates two major approaches to augmenting Transformers with recurrence - depth-wise recurrence (Universal Transformers) and chunk-wise recurrence (Temporal Latent Bottleneck). The authors propose novel extensions to these models, including a global mean-based dynamic halting mechanism for Universal Transformer and an augmentation of Temporal Latent Bottleneck with elements from Universal Transformer. The models are compared and their inductive biases are probed in several diagnostic tasks, revealing the strengths and limitations of each approach.