Core Concepts
Increasing the number of LLM calls in compound inference systems can initially improve performance but may later lead to a decline due to varying query difficulties.
Abstract
The study explores how the quantity of Large Language Model (LLM) calls influences compound inference systems. It reveals that while more calls can enhance performance at first, they may hinder it later due to diverse query complexities. The research delves into theoretical and empirical analyses, highlighting a non-monotonic relationship between LLM calls and system performance across different language tasks. By understanding this scaling behavior, optimal numbers of LLM calls can be determined for maximum accuracy.
The study focuses on one-layer Voting Inference Systems, demonstrating that their simplicity belies complex scaling properties influenced by query difficulty diversity. The analysis uncovers an unexpected trend where increasing LLM calls initially boosts performance but eventually diminishes it due to varying query complexities. This insight allows for the computation of a scaling law that predicts system performance accurately and identifies optimal call numbers for peak accuracy.
The experiments conducted validate the theoretical findings on synthesized datasets with controlled item difficulties. The results showcase how increasing LLM calls does not always translate to improved system performance, emphasizing the importance of considering query complexity in system design. Furthermore, real-world dataset experiments confirm the predictive power of the proposed scaling law in optimizing ensemble size selection without exhaustive search efforts.
Stats
Figure 1: How the number of calls to GPT-3.5 affects its performance on the MMLU college mathematics dataset [HBB+20] when aggregating results via majority vote.
Figure 2: Performance breakdown on easy and hard items as the number of LLM calls increases.
Table 1: Notations used in analyzing item difficulty's impact on Voting Inference Systems' performance.
Lemma 3: Formulation explaining how item difficulty shapes system performance based on incomplete beta functions.
Lemma 4: Equation detailing incomplete beta function calculations for predicting system performance.
Quotes
"More LLM calls do not necessarily improve AI systems' performance."
"It is crucial to understand how compound systems scale with varying numbers of LLM calls."
"The study opens up avenues for effective AI system construction."