The paper examines the robustness of large language model (LLM) evaluation to the distributional assumptions of benchmarks. The key findings are:
The correlation in model performance across test prompts within a benchmark is non-random, indicating inherent relationships between the prompts.
Accounting for these correlations can change model rankings on major benchmarks, with rank changes as large as 5 positions out of 14 models.
The similarity in model performance can be explained by semantic similarity between prompts, but is more likely driven by common failure points of the LLMs.
Different weighting schemes for benchmark prompts, such as prioritizing diverse or representative prompts, can significantly impact the aggregate performance and relative ranking of models.
The authors propose a novel approach to assess the robustness and adequacy of benchmarks used in evaluating LLMs, by analyzing the performance of multiple LLMs on a set of major benchmarks. This provides a framework for identifying and mitigating biases in benchmark design, which is crucial for fair and reliable comparisons of LLM performance.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問