Analog In-Memory Computing Attention Mechanism for Fast and Energy-Efficient Large Language Model Inference
A novel analog in-memory computing architecture based on gain cell memories can perform attention computations for large language models with significantly lower latency and energy consumption compared to GPUs.