Khái niệm cốt lõi
Single-prompt evaluation of large language models leads to unstable and unreliable results. A multi-prompt evaluation approach is necessary to provide a more robust and meaningful assessment of model capabilities.
Tóm tắt
The authors argue that the common practice of evaluating large language models (LLMs) using a single instruction template per task is problematic, as it leads to unstable and unreliable results. They create a large-scale dataset of over 175 paraphrased instructions for 39 tasks across three benchmarks (LMENTRY, BIG-bench Lite, and BIG-bench Hard), involving 20 different LLMs.
The analysis reveals that different instruction templates can lead to vastly different absolute and relative model performance. For example, the ranking of models can change significantly depending on the chosen prompt. The authors also find that models can exhibit drastic performance differences even for minor changes in the wording of the instruction.
To address these limitations, the authors propose a set of multi-prompt evaluation metrics tailored for different use cases, such as measuring model robustness (average performance) or suitability for a specific downstream application (maximum performance). Evaluating the models using these metrics provides new insights into their strengths and limitations, which are not captured by the original single-prompt evaluations.
The authors also show that their automatic paraphrasing method is effective, and there is no need to manually verify the paraphrases, as the metrics are largely unaffected by the quality of the paraphrases.
Finally, the authors perform a small-scale evaluation of OpenAI models, demonstrating that they are also sensitive to prompt paraphrasing, further highlighting the need for a multi-prompt evaluation approach.
Thống kê
"Different instruction templates can lead to vastly different absolute and relative model performance."
"Models can exhibit drastic performance differences even for minor changes in the wording of the instruction."
"The ranking of models can change significantly depending on the chosen prompt."
Trích dẫn
"We find that different instruction templates lead to very different performance, both absolute and relative."
"Instead, we propose a set of diverse metrics on multiple instruction paraphrases, specifically tailored for different use cases (e.g., LLM vs. downstream development), ensuring a more reliable and meaningful assessment of LLM capabilities."
"Our results suggest that future work should use multi-prompt LLM evaluations and choose a metric for aggregating the results according to the extrinsic needs of the evaluators."