Deeper transformer language models exhibit better compositional generalization than shallower models, even when controlling for total parameter count.
Interpretability techniques originally designed for transformer language models, such as contrastive activation addition, the tuned lens, and probing for latent knowledge, can be effectively applied to state-of-the-art recurrent neural network architectures like Mamba and RWKV, with similar performance to transformers.
The authors present two new sequence model architectures, Eagle (RWKV-5) and Finch (RWKV-6), that improve upon the RWKV-4 architecture by incorporating multi-headed matrix-valued states and dynamic recurrence mechanisms. These advancements enhance the models' expressivity while maintaining the efficient inference and training characteristics of RNNs.
Cross-Architecture Transfer Learning (XATL) can significantly reduce the training time and improve the performance of Low-Cost Inference (LCI) Transformer models by directly transferring compatible weights from pre-trained Transformer models, without the need to train the LCI models from scratch.
Transformer-based language models can learn to dynamically allocate compute resources across input sequences, optimizing the allocation along the sequence and across model depth. This allows for significant compute savings without sacrificing performance.
Emergent modularity spontaneously arises in pre-trained language models, and unlocking this modularity through fine-tuning as Emergent Mixture-of-Experts (EMoE) can improve downstream in-domain and out-of-domain generalization.