Core Concepts
Introducing Tandem Transformers to enhance inference efficiency by combining small autoregressive models with large block mode models.
Abstract
The content introduces Tandem Transformers as a novel architecture to improve inference efficiency for large language models. It discusses the autoregressive nature of conventional LLMs, the challenges faced, and how Tandem Transformers address these issues. The architecture, training process, experiments, and results are detailed, showcasing the benefits of using Tandem Transformers in various scenarios.
Introduction
- Autoregressive nature limits inference speed.
- Challenges in leveraging ML accelerators efficiently.
- Introduction of Tandem Transformers to address limitations.
Tandem Transformer Architecture
- Combination of small autoregressive model and large block mode model.
- Process flow for ML and MS in generating tokens.
- Training process and configurations explored.
Experiments and Results
- Performance evaluation on benchmark datasets.
- Latency evaluation within SPEED framework.
- Standalone evaluation of Tandem model's downstream tasks performance.
- Adaptive block length approach for improved efficiency.
Stats
On the PaLM2 pretraining dataset, a Tandem of PaLM2-Bison and PaLM2-Gecko demonstrates a 3.3% improvement in next-token prediction accuracy over a standalone PaLM2-Gecko, offering a 1.16× speedup compared to a PaLM2-Otter model with comparable downstream performance.
Quotes
"The autoregressive nature restricts the full utilization of ML accelerators."
"Tandem Transformers substantially boost the small model’s predictive accuracy."