Wu, P., Liu, J., Gong, Z., Wang, Q., Li, J., Wang, J., Cai, X., & Zhao, D. (2024). FIRP: Faster LLM inference via future intermediate representation prediction. arXiv preprint arXiv:2410.20488.
This paper introduces FIRP, a novel approach to accelerate the inference speed of Large Language Models (LLMs) by predicting the intermediate hidden states of future tokens during decoding.
FIRP employs a trainable linear projection to predict the hidden states of future tokens in intermediate layers of the LLM. These predicted hidden states are then fed through subsequent layers, allowing them to interact with the context and refine their representations. Finally, the original language model head is used to decode the draft tokens from the predicted hidden states. The method utilizes a tree attention mechanism to verify multiple draft sequences simultaneously, further enhancing efficiency.
FIRP offers a promising solution for accelerating LLM inference without compromising generation quality. By predicting and refining intermediate hidden states, FIRP enables the generation of multiple tokens in a single forward pass, effectively leveraging the parallel processing capabilities of modern hardware.
This research contributes to the ongoing efforts in optimizing LLM inference, addressing a critical bottleneck in deploying these models for real-world applications. The proposed method's efficiency and accuracy have the potential to significantly impact various domains reliant on fast and accurate text generation.
The paper primarily focuses on greedy decoding and could be extended by exploring the effectiveness of FIRP with other decoding strategies like beam search. Further investigation into optimizing the selection of prediction layers and exploring different architectures for hidden state prediction could yield additional performance improvements.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania