Mixture-of-Depths: Dynamically Allocating Compute in Transformer-based Language Models to Improve Efficiency
Transformer-based language models can learn to dynamically allocate compute resources across input sequences, optimizing the allocation along the sequence and across model depth. This allows for significant compute savings without sacrificing performance.