Efficient Streaming Deployment of Large Language Models with Attention Sinks
Introducing StreamingLLM, an efficient framework that enables large language models trained with a finite attention window to work on text of infinite length without fine-tuning by leveraging attention sinks.