Concepts de base
Precomputing the first layer of transformers with RoPE can lead to lower latency and cost-per-token, optimizing inference speed.
Stats
For example, the maximum savings for a model with only 4 layers (such as Whisper tiny) is limited to 25%, while a 32-layer model is limited to 3% savings.
Reads per batch: B · d + num_weights_Q_K_V_FFN