The paper introduces proxy-tuning, a lightweight decoding-time algorithm that can efficiently customize large pretrained language models without accessing their internal weights. The key idea is to leverage small tuned models as "experts" to guide the predictions of the larger base model.
The authors first evaluate proxy-tuning for instruction-following tasks, where they use a small 7B-parameter CHAT model as the expert to steer larger 13B and 70B LLAMA2 base models. Proxy-tuning is able to close 91.1% and 88.1% of the performance gap between the base models and their directly tuned CHAT counterparts, respectively, across a range of benchmarks testing knowledge, reasoning, and safety.
The authors then demonstrate the generality of proxy-tuning by applying it to domain adaptation on code, and task-specific finetuning on question-answering and math problems. For code adaptation, proxy-tuning the 13B and 70B base models leads to 17-32% and 6-8% absolute improvements, respectively, over the untuned base models. For task finetuning, proxy-tuning the 70B model achieves 31% absolute improvement on average over the untuned 70B, and 9% over the tuned 7B task model.
The paper also analyzes how proxy-tuning influences the token-level probability distribution, finding that it has the largest impact on promoting reasoning and stylistic tokens. Additionally, the authors show how a hyperparameter can be introduced to the proxy-tuning formula to provide more granular control over the strength of the tuning.
Finally, the authors present a case study of applying proxy-tuning to the truly black-box GPT-3.5 model, in an extremely limited-information setting where only the top 5 log probabilities are available. Using LLAMA2-7B as the expert, they are able to proxy-tune GPT-3.5 for temporal adaptation, improving its accuracy on questions about recent events.
Overall, the paper demonstrates the promise of proxy-tuning as an efficient and effective approach for customizing large language models to diverse user needs and applications, without requiring access to the model's internal weights.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies