Core Concepts
Selectively removing certain Transformer layers or sublayers in large language models can improve inference efficiency without significantly impacting performance.
Abstract
This report explores methods to enhance the inference efficiency of large language models (LLMs) by investigating optimization strategies and architectural innovations. The key insights are:
Larger LLMs are becoming increasingly common as they can train more quickly on larger datasets. However, this growth in size severely impacts the computational cost of running these models during inference.
The authors hypothesize that dropping certain Transformer layers or sublayers (attention vs. feedforward) in pre-trained LLMs could retain performance while improving inference efficiency. They explore three main approaches:
Dropping full Transformer layers
Selectively dropping attention or feedforward sublayers
Selectively dropping layers based on the similarity between consecutive layer outputs
Experiments on benchmarks like ARC, Hellaswag, TruthfulQA, and MMLU show that skipping latter attention sublayers can provide a 21% speed increase in one-token generation for the LLaMA 2 7B model, while surprisingly improving performance on several common benchmarks.
The authors conclude that latter attention sublayers in LLMs are often redundant and computationally expensive, suggesting that selective layer removal is a promising approach to improve inference efficiency without significantly impacting model performance.
Stats
The report does not contain any explicit numerical data or statistics. The key findings are based on empirical experiments and qualitative analysis.
Quotes
"We empirically show that latter attention sublayers are not critical for inference in LLMs, and even are a drawback."
"Selectively skipping Transformer layers if the output vectors between consecutive layers are close."