核心概念
DYVAL introduces a dynamic evaluation protocol for large language models, emphasizing the importance of dynamic evaluation over static benchmarks to assess evolving capabilities accurately.
要約
Large language models (LLMs) face challenges with data contamination and fixed complexity in existing benchmarks. DYVAL offers a flexible protocol for dynamic evaluation, generating diverse samples for reasoning tasks. Results show LLMs struggle with increasing complexity, highlighting the need for evolving evaluations. Failure analysis reveals various error patterns, suggesting room for improvement. Fine-tuning on DYVAL-generated data enhances LLM performance on existing benchmarks.
統計
Experiments show that LLMs perform worse in DYVAL-generated evaluation samples with different complexities.
GPT-4 performs best in most tasks, followed by GPT-3.5-Turbo.
Human evaluators are surpassed by both GPT-4 and GPT-3.5-Turbo in most tasks.
Various failure modes include partial calculation errors, incorrect reasoning, self-contradiction, unsubstantiated responses, and instructional oversights.
引用
"Results on DYVAL evaluation are not always consistent with those on existing benchmarks."
"As difficulty increases, LLMs tend to perform worse and their performance gap becomes larger."
"No prompt engineering methods can perform best in all of our evaluation sets."