Core Concepts
TREACLE, a reinforcement learning-based framework, dynamically selects the optimal large language model (LLM) and prompting scheme to answer questions while respecting user-defined cost and latency constraints.
Abstract
The paper proposes TREACLE, a framework that addresses the challenge of efficiently using the growing number of large language models (LLMs) with varying capabilities and costs. TREACLE uses a reinforcement learning policy to dynamically select the appropriate LLM and prompting scheme for each question, in order to maximize accuracy while respecting the user's monetary budget and latency constraints.
The key insights are:
- Characterization of the accuracy, monetary cost, and latency trade-offs of different LLM and prompting combinations. This shows that more powerful LLMs and more sophisticated prompts (e.g., chain-of-thought) can significantly improve accuracy, but at a higher financial cost.
- TREACLE's reinforcement learning policy leverages the context of the current question (e.g., text embedding, response history) to make smart decisions about which LLM and prompt to use. It can also re-query models to verify the consistency of responses.
- Extensive evaluations show that TREACLE can achieve up to 85% cost savings compared to using individual LLMs, while maintaining high accuracy on challenging reasoning tasks. It is also robust to changes in LLM availability, pricing, and question difficulty.
Overall, TREACLE provides a unified framework for efficiently navigating the rich design space of LLM capabilities, costs, and prompting schemes, in order to meet user requirements.
Stats
The average cost (in log $) and average latency (in log 1/50000 s) for different model-prompt combinations on the GSM8K test dataset are:
Llama-2-7b CoT: (4.0, 14.0)
Llama-2-13b CoT: (4.3, 14.5)
GPT-3.5 Domain expert: (2.7, 15.5)
GPT-3.5 CoT: (3.0, 15.5)
GPT-4 Domain expert: (3.0, 16.0)
GPT-4 CoT: (3.4, 16.0)
Quotes
"TREACLE uses the problem context, including question text embeddings (reflecting the type or difficulty of a query) and the response history (reflecting the consistency of previous responses) to make smart decisions."
"TREACLE achieves the Pareto front of individual LLMs by combining them intelligently."