Core Concepts
Evaluating the reasoning process of large language models (LLMs) through follow-up questions, and incorporating this evaluation into self-training, enhances the models' reasoning abilities and robustness.
Stats
Accuracy of MCREST surpasses other self-training baselines across ReClor, ARC, and CSQA datasets.
In ARC-Challenge and CSQA, peak performance is achieved at a tolerance value (t) of 2, indicating the benefit of excluding less robust rationales from training.
In ReClor, optimal performance is observed at t = 3, suggesting a dataset-specific influence of tolerance on reasoning complexity.
Preference learning with a lambda (λ) value of 0.6, balancing the trade-off between easy and hard sets in ReClor, yields the best overall performance.
CREST leads to improvements across all three FLASK metrics (robustness, correctness, and efficiency) for rationale generation, as evaluated by GPT-4o.