The paper proposes a unified adversarial training (AT) framework to address the unstructured nature of standard simple-gradient saliency maps. The key idea is to apply norm-regularized AT, where the norm function constrained in the AT process is designed to promote desirable properties in the resulting gradient-based interpretation maps.
The authors first provide a convex duality analysis to show the connection between the regularized norm of adversarial perturbations and the norm of input-based gradients. This duality relationship allows them to design norm-regularized AT methods that translate into the regularization of sparsity-inducing norms, such as the group norm and the elastic net, of the simple gradient maps.
The paper presents several numerical experiments on benchmark image datasets to validate the efficacy of the proposed AT-based methodology. The results demonstrate that the norm-regularized AT methods can enhance the sparsity, connectedness, robustness, and stability of the gradient-based interpretation maps, without compromising their fidelity to the original simple gradient maps.
The authors also leverage the duality framework to propose an interpretation harmonization scheme for aligning gradient maps with human attention, which performs satisfactorily in the experiments.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询