Core Concepts
The Sabre adversarial example defense, published at IEEE S&P 2024, contains significant flaws in its evaluation that can be exploited to completely break the defense by modifying just one or two lines of code.
Abstract
The paper begins by critiquing the Sabre defense, which claims to be 3x more robust to adversarial attacks than the current state-of-the-art. The authors identify several issues with the evaluation in the original Sabre paper, including:
Mathematically impossible claims, such as the defense achieving nontrivial accuracy at a perturbation bound of 0.5, and the model performing better under attack than without any attack.
Deviations from recommended best practices for evaluating adversarial robustness, such as not verifying that iterative attacks perform better than single-step attacks, not evaluating against adaptive attacks, and not comparing to prior work accurately.
Additional flaws not observable from the paper alone, such as the defense not being evaluated against adaptive attacks and incorrectly implementing baselines like adversarial training.
The authors then demonstrate two attacks that completely break the Sabre defense. The first attack involves removing an unnecessary BPDA wrapper, which reduces the robust accuracy to 0% on both MNIST and CIFAR-10 datasets. In response, the authors modified the defense to include a new component that discretizes the input. However, the authors show that this modified defense contains a second bug, and a simple one-line change further reduces the robust accuracy to below baseline levels.
The paper concludes by discussing the broader implications of these findings, emphasizing the importance of thorough and rigorous evaluations of adversarial defenses, especially as they are being deployed in real-world production systems.
Stats
On MNIST, the attack success rate increases from 13% to 21% when the discretization component is added to Sabre.
On CIFAR-10, the attack success rate is 100% when the number of decimals is set to 1 in the discretization component.
Quotes
"Sabre makes a number of claims that are impossible for any correct evaluation."
"Sabre also deviates from recommended practices in evaluating adversarial robustness."
"Sabre is not evaluated against adaptive attacks."