The paper proposes a novel parallel decoding approach called Hidden Transfer to accelerate the inference of large language models (LLMs). The key idea is to predict the pseudo hidden states of future tokens in intermediate layers and then pass them through the subsequent transformer layers to generate multiple draft tokens in a single forward propagation.
The authors first analyze the inefficiency of traditional autoregressive decoding, which generates one token per forward pass and fails to fully utilize the parallel computing capabilities of GPUs. They then introduce the Hidden Transfer method, which trains linear projections to transfer the hidden states of the current tokens to the pseudo hidden states of the future tokens in certain intermediate layers. These pseudo hidden states are then passed through the remaining transformer layers, allowing the model to predict multiple draft tokens simultaneously.
The authors conduct extensive experiments to evaluate the effectiveness of their approach. They compare the acceleration ratio of Hidden Transfer with other single-model acceleration techniques, such as Medusa and Self-Speculative Decoding, and demonstrate that Hidden Transfer outperforms these methods in terms of end-to-end time acceleration. Additionally, the authors perform analytical experiments to verify their motivation, showing that the predicted pseudo hidden states are progressively refined through the subsequent transformer layers, gaining more semantic information and improving the accuracy of the draft token predictions.
The paper also discusses the trade-offs in selecting the appropriate transfer layers and the potential limitations of the approach, such as the increased computational resource requirements due to the expansion of the input sequence during both training and inference.
Naar een andere taal
vanuit de broninhoud
arxiv.org
Belangrijkste Inzichten Gedestilleerd Uit
by Pengfei Wu,J... om arxiv.org 04-19-2024
https://arxiv.org/pdf/2404.12022.pdfDiepere vragen