Core Concepts
Polynomial attention with high degree can replace softmax attention without sacrificing model quality, leading to a linear-time Transformer architecture.
Abstract
1. Introduction
Self-attention mechanisms in Transformers pose computational bottlenecks.
Efficient transformers aim to address scalability issues.
Vanilla transformers dominate due to practical training speedups.
2. Polynomial Attention and Approximation
Kernel-based methods offer efficient n × n attention computation.
Approximate feature mapping for polynomial kernels is explored.
3. Dealing with Causal Masks
Block-based lower triangular multiplication for handling causal masks efficiently.
4. Experiments
Synthetic tasks measure content-aware reasoning and memorization capabilities.
Real-world datasets train decoder-only models mirroring GPT-2 family scales.
Stats
Context Length µs/token: Vanilla Softmax, Polysketch, FlashAttention (Block = 256), FlashAttention (Block = 512)
Train Step latency per token comparison between GPT-2 small style models with different attention mechanisms.