The authors argue that in order for explanations provided by large language models to be informatively faithful, it is not enough to simply test whether they mention significant factors - we also need to test whether they mention significant factors more often than insignificant ones.
The paper makes the following key contributions:
It introduces Correlational Explanatory Faithfulness (CEF), a novel faithfulness metric that improves upon prior work by capturing both the degree of impact of input features on model predictions, as well as the difference in explanation mention frequency between impactful and non-impactful factors.
It introduces the Correlational Counterfactual Test (CCT), where CEF is instantiated on the Counterfactual Test (CT) from prior work, using statistical distance between predictions to measure impact.
It runs experiments with the Llama2 family of language models on three datasets and demonstrates that CCT captures faithfulness trends that the existing faithfulness metric used in CT misses.
The authors find that model explanations are more likely to mention inserted words when they're more impactful to the model's predictions, suggesting a degree of faithfulness that increases with model size. However, there is significant variance between datasets, which could be due to the nature of the task or the annotator-provided explanations.
إلى لغة أخرى
من محتوى المصدر
arxiv.org
الرؤى الأساسية المستخلصة من
by Noah Y. Sieg... في arxiv.org 04-05-2024
https://arxiv.org/pdf/2404.03189.pdfاستفسارات أعمق