المفاهيم الأساسية
This work introduces an efficient attention mechanism called Infini-attention that enables Transformer-based Large Language Models (LLMs) to effectively process infinitely long inputs with bounded memory and computation.
الملخص
The paper introduces a novel attention mechanism called Infini-attention that enables Transformer-based LLMs to efficiently process infinitely long input sequences.
Key highlights:
- Infini-attention incorporates a compressive memory into the standard attention mechanism, allowing it to maintain the entire context history with bounded memory.
- It combines both masked local attention and long-term linear attention in a single Transformer block, enabling effective modeling of both short-range and long-range dependencies.
- The authors demonstrate the effectiveness of their approach on long-context language modeling, 1M sequence length passkey retrieval, and 500K length book summarization tasks.
- Infini-Transformers, models using the Infini-attention, outperform baseline models while introducing minimal additional parameters and enabling fast streaming inference.
The paper makes the following key contributions:
- Introduces a practical and powerful attention mechanism - Infini-attention - that efficiently models long and short-range contextual dependencies.
- Infini-attention requires minimal changes to the standard attention layer, enabling plug-and-play continual pre-training and long-context adaptation.
- The approach allows Transformer LLMs to scale to infinitely long contexts with bounded memory and compute resources by processing inputs in a streaming fashion.
الإحصائيات
The grass is green.
The sky is blue.
The sun is yellow.
The pass key is 9054.