Per-tensor quantization method, FlattenQuant, significantly improves inference efficiency for large language models by reducing memory consumption and latency.
Clustered Head Attention (CHAI) reduces memory and compute requirements in Large Language Models (LLMs) by clustering correlated attention heads.