The paper proposes IceFormer, a new method for improving the inference time efficiency of pretrained Transformers on CPUs. Key highlights:
IceFormer does not require retraining the model, can be applied to a variety of Transformer-based models, and achieves high accuracy and fast inference.
IceFormer addresses the quadratic time and space complexity of the self-attention mechanism in Transformers by using a sparse attention mechanism. It identifies the most important keys for each query using k-nearest neighbor search in an embedded space.
Experiments on the Long Range Arena (LRA) benchmark show that IceFormer achieves a 7.63x speedup on average compared to the vanilla Transformer while retaining 98.6% of its accuracy.
On the ZeroSCROLLS benchmark for large language models (LLMs), IceFormer achieves a 2.73x speedup on average compared to a leading LLaMA 2-based LLM while retaining 99.6% of its accuracy.
IceFormer also demonstrates superior scalability on the LongEval benchmark, maintaining its efficiency advantage over the vanilla Transformer as the input sequence length increases.
Overall, IceFormer provides an effective solution for accelerating the inference of long-sequence Transformers on CPUs without the need for retraining, making it well-suited for deploying LLMs on commodity hardware.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Yuzhen Mao,M... alle arxiv.org 05-07-2024
https://arxiv.org/pdf/2405.02842.pdfDomande più approfondite