Parallel Decoding with Hidden Transfer: An Efficient Approach for Accelerating Large Language Model Inference
A novel parallel decoding approach, named Hidden Transfer, that predicts the pseudo hidden states of future tokens in intermediate layers to enable simultaneous generation of multiple tokens in a single forward pass, thereby improving the inference efficiency of large language models.