Regularizing the norm of natural input gradients can achieve near state-of-the-art adversarial robustness on ImageNet, with significantly lower computational cost than adversarial training. The effectiveness of this approach critically depends on the smoothness of the activation functions used in the model architecture.
The core message of this paper is that by mixing the output probabilities of a standard neural network classifier and a robust neural network classifier, the accuracy-robustness trade-off can be significantly alleviated, achieving high clean accuracy while maintaining strong adversarial robustness.