Examining Biases in Large Language Model Evaluation: The Impact of Distributional Assumptions in Benchmarks
Benchmark prompts within a given evaluation dataset often exhibit non-random correlations in model performance, which can significantly impact the relative ranking of large language models.