Core Concepts
Leveraging a lightweight language model to guide a large language model in reasoning tasks can improve the quality of generated rationales and enhance overall task performance.
Abstract
The content introduces a novel framework called LM-Guided Chain-of-Thought (CoT) that utilizes two independent language models (LMs) - a small LM for rationale generation and a large LM for answer prediction.
The key steps are:
Rationale Distillation: The small LM is trained using knowledge distillation to learn reasoning capabilities from the large LM.
Rationale Refinement: The small LM's rationales are further optimized using reinforcement learning based on 8 rationale quality aspects (factuality, relevance, logicality, consistency, coherence, fluency, naturalness, readability).
The authors conduct experiments on multi-hop question answering tasks using HotpotQA and 2WikiMultiHopQA datasets. The results show that the LM-Guided CoT approach outperforms standard prompting and the original CoT prompting, especially in terms of answer prediction accuracy and rationale quality. The reinforcement learning step also contributes to slight improvements in both rationale quality and task performance.
The authors also find that selecting the highest-quality rationales does not always guarantee improved task performance, highlighting the need to balance rationale utilities and overall task objectives.
Stats
The authors report the following key figures:
The standard prompting approach achieves EM scores of 0.5 and 0.5 on HotpotQA and 2WikiMultiHopQA respectively.
The original CoT prompting approach achieves EM scores of 0.483 and 0.4 on HotpotQA and 2WikiMultiHopQA respectively.
The LM-Guided CoT prompting with knowledge distillation (KD) achieves EM scores of 0.507 and 0.506 on HotpotQA and 2WikiMultiHopQA respectively.
The LM-Guided CoT prompting with KD and self-consistency (SC) decoding achieves the highest EM scores of 0.513 and 0.524 on HotpotQA and 2WikiMultiHopQA respectively.
Quotes
"LM-guided CoT prompting outperforms both the standard prompting and the original CoT prompting."
"We find that (1) LM-guided CoT with KD and self-consistency (SC) decoding strategy maximizes the performance gain; (2) RL contributes to a slight increase in overall rationale quality and task performance; (3) choosing the highest-quality rationales for the large LM does not always guarantee improved task performance."