The authors argue that the common practice of evaluating large language models (LLMs) using a single instruction template per task is problematic, as it leads to unstable and unreliable results. They create a large-scale dataset of over 175 paraphrased instructions for 39 tasks across three benchmarks (LMENTRY, BIG-bench Lite, and BIG-bench Hard), involving 20 different LLMs.
The analysis reveals that different instruction templates can lead to vastly different absolute and relative model performance. For example, the ranking of models can change significantly depending on the chosen prompt. The authors also find that models can exhibit drastic performance differences even for minor changes in the wording of the instruction.
To address these limitations, the authors propose a set of multi-prompt evaluation metrics tailored for different use cases, such as measuring model robustness (average performance) or suitability for a specific downstream application (maximum performance). Evaluating the models using these metrics provides new insights into their strengths and limitations, which are not captured by the original single-prompt evaluations.
The authors also show that their automatic paraphrasing method is effective, and there is no need to manually verify the paraphrases, as the metrics are largely unaffected by the quality of the paraphrases.
Finally, the authors perform a small-scale evaluation of OpenAI models, demonstrating that they are also sensitive to prompt paraphrasing, further highlighting the need for a multi-prompt evaluation approach.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Moran Mizrah... klo arxiv.org 05-07-2024
https://arxiv.org/pdf/2401.00595.pdfSyvällisempiä Kysymyksiä