Core Concepts
Divide the dataset into subsets based on confidence score and employ a specialized Filter Choices based Reasoning (FCR) method to improve performance on the low confidence subset, achieving an optimal balance between cost and accuracy.
Abstract
The paper proposes a Divide and Conquer Reasoning (DCR) strategy to enhance the reasoning capabilities of large language models (LLMs) for multi-choice question (MCQ) answering.
Divide Stage:
- The dataset is first divided into two subsets - Dother and Dlow - based on a confidence score (CS) that reflects the model's certainty in its answers.
- CS is computed by analyzing the statistical frequency of the generated answers from multiple inference runs using Zero-Shot-CoT.
- Questions with CS above a threshold µ are categorized as Dother, while the rest are in Dlow.
Conquer Stage:
- For the Dother subset, the original answers are used directly.
- For the Dlow subset, a Filter Choices based Reasoning (FCR) method is employed.
- FCR filters the original choice list by using the answers generated in the divide stage, and then re-queries the LLM with the reduced choice set.
The experiments show that DCR achieves an average accuracy improvement of 1.56% across 9 datasets, while consuming only 85% of the resources required by the state-of-the-art method. DCR is also evaluated on different LLMs and shown to be effective. Further analysis reveals the relationship between confidence score and accuracy, as well as the benefits of reducing the number of choices.
Stats
The average sample size (inference times) for each question across 9 datasets is 5.79 for DCR, compared to 6.79 for ESC and 8.94 for SC.
The average accuracy of DCR is 69.08%, surpassing SC and ESC by 1.56%.
Quotes
"To escape this sky-high cost, ESC (Li et al., 2024) early-stops inference by calculating the entropy of answer distribution in a small sliding window without sacrificing SC's performance, which achieves SOTA currently."
"Motivated by these findings, we introduce Filter Choices based Reasoning (FCR), which excludes abundant options by using the answers from the divide stage, to conduct inference in conquer stage."
"Through extensive empirical evaluation across nine datasets including arithmetic, commonsense, and logic tasks, DCR not only consumes on average only 85% of resources required by ESC, but also improves accuracy by an average of 1.56% on these datasets."