Concetti Chiave
Emotional prosody can be effectively used as a backdoor trigger to compromise the integrity of speaker identification deep neural networks.
Sintesi
The paper introduces a novel backdoor attack called "EmoBack" that leverages emotional prosody as a trigger to compromise the integrity of speaker identification (SI) deep neural networks (DNNs). The authors conducted a comprehensive parameter study using three different datasets (ESD-en, ESD-zh, and RAVDESS) and three DNN architectures (ResNet, X-vectors, and ECAPA-TDNN) to evaluate the impact of emotions as backdoor triggers on the accuracy of SI systems.
The key findings are:
- The proposed EmoBack attack is highly effective, achieving Attack Success Rates (ASRs) up to 98.9% while maintaining high Clean Accuracy (CA) of at least 86.4% across all models and datasets.
- Certain emotions, such as Sad and Neutral, are more effective as backdoor triggers compared to others like Happy and Surprise.
- The attack performance is influenced by the model architecture, dataset, and poisoning rate, with the ESD-zh dataset and ECAPA-TDNN/ResNet models being more vulnerable.
- Gender does not exhibit a consistent bias in the attack's effectiveness.
- The authors explored defense strategies, including pruning, STRIP-ViTA, and preprocessing techniques (quantization, median filtering, and squeezing), to mitigate the attack. Among these, pruning shows the potential to reduce the ASR up to 40% without significantly affecting the CA.
- The results highlight the vulnerability of SI systems to emotional prosody-based backdoor attacks and provide insights into potential solutions for reinforcing these systems against such attacks.
Statistiche
"The ASR for male speakers ranged from 52.4% (Happy) to 70.7% (Sad) on the ESD-en dataset."
"The ASR for female speakers ranged from 60.2% (Happy) to 76.3% (Surprise) on the ESD-en dataset."
"The ASR ranged from 65.4% (Happy) to 89.1% (Neutral) on the ESD-zh dataset for male speakers."
"The ASR varied from 55.7% (Surprise) to 84.2% (Neutral) on the ESD-zh dataset for female speakers."
"The ASR for male speakers ranged from 77.6% (Happy) to 93.8% (Sad) on the ESD-en dataset using the ResNet model."
"The ASR for female speakers ranged from 80.9% (Happy) to 94.7% (Sad) on the ESD-en dataset using the ResNet model."
Citazioni
"The use of an emotional trigger is inconspicuous and more likely to be persistent and reusable, making it an effective method for avoiding detection."
"Our findings show that the aforementioned models are prone to our attack, indicating that emotional triggers (sad and neutral prosody) can be effectively used to compromise the integrity of SI systems."
"Among the defenses tested, pruning shows the potential to mitigate the attack's impact when pruning multiple convolutional layers, decreasing the ASR up to 40% without affecting the CA."