The content introduces EURUS, a suite of large language models (LLMs) optimized for reasoning. EURUS models are finetuned from Mistral-7B and CodeLlama-70B and achieve state-of-the-art results on a diverse set of benchmarks covering mathematics, code generation, and logical reasoning problems.
The key to EURUS's strong performance is ULTRAINTERACT, a newly-curated large-scale, high-quality alignment dataset designed specifically for complex reasoning tasks. ULTRAINTERACT includes:
ULTRAINTERACT can be used in both supervised fine-tuning and preference learning. Experiments show that using ULTRAINTERACT along with established datasets in instruction fine-tuning already achieves strong performance. ULTRAINTERACT further facilitates preference learning for reasoning tasks, improving the performance even further with KTO and NCA. Surprisingly, DPO hurts the performance, which is analyzed in-depth.
The content also introduces EURUS-RM-7B, a reward model trained on ULTRAINTERACT that demonstrates especially strong preference modeling performance on reasoning tasks, outperforming even GPT-4 on certain benchmarks.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Lifan Yuan,G... at arxiv.org 04-03-2024
https://arxiv.org/pdf/2404.02078.pdfDeeper Inquiries