The paper investigates the explanation robustness of rationalization models, which select a subset of input text as rationale to provide explanations for model predictions. The authors propose UAT2E, a variant of Universal Adversarial Triggers, to conduct both non-target and target attacks on the explanations of rationalization models.
The key findings are:
Existing rationalization models exhibit significant fragility in explanation robustness, even when predictions remain unchanged. The attacks can lead the models to select more meaningless tokens or triggers as rationales.
The explanation vulnerability of rationalization models arises from their inherent defects such as degeneration and spurious correlation, which are exacerbated by the attacks.
Using powerful encoders like BERT or supervised training with human-annotated rationales does not guarantee robustness of explanations; instead, it makes the explanations more susceptible to attack influence.
Enhancing prediction robustness through adversarial training does not effectively improve explanation robustness.
The gradient-based search used in UAT2E is effective in identifying optimal triggers to undermine explanation robustness.
Based on these findings, the authors provide recommendations to improve the explanation robustness of rationalization models, including conducting rigorous evaluations, exploring defense mechanisms, and establishing robust evaluation benchmarks and metrics.
In un'altra lingua
dal contenuto originale
arxiv.org
Approfondimenti chiave tratti da
by Yuankai Zhan... alle arxiv.org 09-19-2024
https://arxiv.org/pdf/2408.10795.pdfDomande più approfondite