The paper examines the robustness of large language model (LLM) evaluation to the distributional assumptions of benchmarks. The key findings are:
The correlation in model performance across test prompts within a benchmark is non-random, indicating inherent relationships between the prompts.
Accounting for these correlations can change model rankings on major benchmarks, with rank changes as large as 5 positions out of 14 models.
The similarity in model performance can be explained by semantic similarity between prompts, but is more likely driven by common failure points of the LLMs.
Different weighting schemes for benchmark prompts, such as prioritizing diverse or representative prompts, can significantly impact the aggregate performance and relative ranking of models.
The authors propose a novel approach to assess the robustness and adequacy of benchmarks used in evaluating LLMs, by analyzing the performance of multiple LLMs on a set of major benchmarks. This provides a framework for identifying and mitigating biases in benchmark design, which is crucial for fair and reliable comparisons of LLM performance.
To Another Language
from source content
arxiv.org
Önemli Bilgiler Şuradan Elde Edildi
by Melissa Aile... : arxiv.org 04-29-2024
https://arxiv.org/pdf/2404.16966.pdfDaha Derin Sorular