核心概念
Dynamically selecting the most suitable smaller "draft" language model to guide a larger language model's text generation, based on the input query, can significantly improve inference speed without sacrificing output quality.
摘要
Bibliographic Information:
Huang, J., Parthasarathi, P., Rezagholizadeh, M., & Chandar, S. (2024). Context-Aware Assistant Selection for Improved Inference Acceleration with Large Language Models. arXiv preprint arXiv:2408.08470.
Research Objective:
This research paper investigates how to improve the inference speed of large language models (LLMs) by dynamically selecting the most appropriate smaller "draft" model for assisted decoding, based on the input query.
Methodology:
The authors frame the problem as a contextual bandit problem, where each draft model represents an "arm" and the reward is the inference speed-up achieved. They collect an offline dataset of query-draft model-reward tuples by measuring the alignment between draft and target model outputs on a set of training examples. This dataset is then used to train a policy network using REINFORCE algorithm to select the best draft model for a given query at inference time.
Key Findings:
- Using a policy to dynamically select among multiple draft models, each specialized in different domains, leads to consistent inference acceleration across all domains, outperforming the use of any single draft model.
- The policy effectively learns to balance the trade-off between draft model alignment with the target model and the draft model's size (which influences inference speed).
- The policy can be trained efficiently with a relatively small number of training examples.
- Incorporating an option for auto-regressive generation (using only the target model) allows the policy to bypass draft models when they are not beneficial, further improving overall speed.
- The proposed approach generalizes to different LLM architectures (T5 and decoder-only Transformers) and self-drafting scenarios (using intermediate layers of the target model as draft models).
Main Conclusions:
This research demonstrates the effectiveness of context-aware draft model selection for accelerating LLM inference. By learning to choose the most suitable draft model based on the input query, significant speed-ups can be achieved without compromising generation quality.
Significance:
This work contributes to the growing field of efficient LLM inference by introducing a novel and practical approach for adaptive assisted decoding. It paves the way for deploying more powerful LLMs in real-world applications with limited computational resources.
Limitations and Future Research:
The paper primarily focuses on offline policy learning and using token-level similarity for reward estimation. Future research could explore online learning methods and more sophisticated alignment metrics. Additionally, integrating dynamic draft model selection with other adaptive speculative decoding techniques could further enhance inference efficiency.
统计
Using T5-Small as a draft model for a Flan-T5-XXL target on the XSUM dataset resulted in a 21% speedup compared to auto-regressive generation.
The same T5-Small draft model resulted in only a 3% speedup on the IWSLT2017 EN-DE translation task.
A T5-Small model fine-tuned on XSUM achieved a 17% slowdown on the IWSLT2017 EN-DE task compared to auto-regressive generation.
Training the policy on less than 10,000 examples was sufficient to achieve comparable decoding speed to training on the entire dataset of nearly 400,000 examples.
引用
"Rather than sampling directly from the larger model, the sampling is approximated with samples from a smaller and cheaper model through accept-reject sampling."
"By dynamically choosing between different draft models in any given scenario, then benefits from each draft model can be observed as long as the decision maker is competent and efficient."
"This highlights some immediate benefits of policy use, namely that it can identify the correct draft model for a context without any explicit information regarding the draft candidates themselves."