The paper explores the opportunities for dynamic layer sparsity within large language models (LLMs) and vision transformers (ViTs). It profiles the residual blocks in these models to quantify the contribution of each layer to the final output, finding that as model size increases, the median contribution of individual layers decreases significantly, often to around 1% or less.
This dynamic layer sparsity can be exploited through a novel neural architecture called Radial Networks. Radial Networks perform token-level routing between layers, guided by a trained router module. This allows them to invoke only a subset of the model layers for each token, reducing the average network depth and lowering model latency. Radial Networks can be used for post-training distillation from sequential networks or trained from scratch to co-learn the router and layer weights.
The paper demonstrates that Radial Networks enable scaling to larger model sizes by decoupling the number of layers from the dynamic depth of the network. By varying the compute token by token, they reduce the overall resources needed for generating entire sequences, leading to larger capacity networks with significantly lower compute and serving costs.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Jordan Dotze... at arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.04900.pdfDeeper Inquiries