Core Concepts
Larger language models, despite their superior performance, do not always generate the best responses for fine-tuning smaller language models for instruction following, a phenomenon dubbed the "Larger Models’ Paradox." This paradox highlights the importance of compatibility between the response generator and the base model in instruction tuning.
Stats
Gemma-2-9b-it and Qwen2.5-72B-Instruct emerged as the two best response generators based on average performance across benchmarks.
Open-source LLMs, including Gemma 2, Llama 3, Llama 3.1, and Qwen2.5, significantly outperformed GPT-4 as response generators.
Higher temperature and top-p values during response generation led to improved instruction-following capabilities.
Reject sampling with a reward model slightly increased instruction-tuning performance compared to standard sampling techniques.
Spearman's rank correlation coefficient (ρ) showed that CAR outperformed baseline metrics (quality, difficulty, response length) in predicting the effectiveness of response generators.