The paper presents a methodology for incorporating human rationales, which are text annotations explaining human decisions, into text classification models. This approach aims to enhance the plausibility of post-hoc explanations while preserving their faithfulness. The key highlights are:
The authors introduce a novel contrastive-inspired loss function that effectively integrates rationales into the model training process. This loss function does not require modifying the model architecture or assuming a specific type of explanation function.
The authors employ a multi-objective optimization framework to explore the trade-off between model performance and explanation plausibility. This allows them to generate a Pareto-optimal frontier of models that balance these two objectives.
Through extensive experiments involving diverse models, datasets, and explainability methods, the authors demonstrate that their approach significantly enhances the quality of model explanations without causing substantial (sometimes negligible) degradation in the original model's performance.
The authors compare their methodology with a previous method from the literature, reinforcing the effectiveness of their approach in improving explanation plausibility while maintaining faithfulness.
The authors discuss the social and ethical implications of "teaching" explanations to text classification models, arguing that these concerns are mitigated when the explanations remain faithful to the model's decision-making process.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Lucas E. Res... kl. arxiv.org 04-05-2024
https://arxiv.org/pdf/2404.03098.pdfDybere Forespørgsler