This paper investigates the impact of various prompt variations on the predictions and performance of large language models (LLMs) across a range of text classification tasks.
The key findings are:
Prompt variations, even minor ones like adding a space or changing the output format, can change a significant proportion of an LLM's predictions, sometimes over 50%. This sensitivity is more pronounced in smaller models like Llama-7B compared to larger ones like Llama-70B and ChatGPT.
While many prompt variations do not drastically impact overall accuracy, certain variations like jailbreaks can lead to substantial performance degradation. The AIM and Dev Mode v2 jailbreaks caused ChatGPT to refuse to respond in around 90% of cases.
Analyzing the similarity of the predictions across prompt variations using multidimensional scaling reveals interesting patterns. Variations that preserve the semantic meaning of the prompt, like adding greetings, tend to cluster together. In contrast, jailbreaks and formatting changes like using ChatGPT's JSON Checkbox feature stand out as outliers.
The authors find a slight negative correlation between annotator disagreement on a sample and the likelihood of that sample's prediction changing across prompt variations. This suggests that the model's confusion on a particular instance is not the sole driver of prediction changes.
Overall, this work highlights the need for robust and reliable prompt engineering when using LLMs, as even minor changes can have significant impacts on model behavior and performance.
Para Outro Idioma
do conteúdo original
arxiv.org
Principais Insights Extraídos De
by Abel Salinas... às arxiv.org 04-03-2024
https://arxiv.org/pdf/2401.03729.pdfPerguntas Mais Profundas