แนวคิดหลัก
Transformer-based models are vulnerable to adversarial attacks, but can be made more robust through techniques like data augmentation and embedding perturbation loss.
บทคัดย่อ
The authors investigate adversarial attacks and defenses for conversation entailment models. They first fine-tune a pre-trained RoBERTa model on a conversation entailment dataset, achieving a strong baseline performance.
For the attack stage, the authors experiment with synonym-swapping as an adversarial attack method. They perform a grid search over different attack parameters like percentage of words to swap, minimum cosine similarity, and maximum number of candidate replacements. The results show that while aggressive attacks can significantly degrade model performance, slight modifications can actually improve the model's accuracy on the test set.
To defend against adversarial attacks, the authors explore several strategies:
- Data augmentation: Fine-tuning the model on the adversarially attacked training data. This improves performance on the attacked test set but hurts performance on the original test set.
- Embedding perturbation loss: Introducing Gaussian noise to the hidden embeddings during training, in addition to the standard cross-entropy loss. This helps improve robustness without sacrificing performance on the original domain.
The authors discuss the practical implications of adversarial attacks on real-world NLP systems and the importance of developing robust defenses. They also propose future research directions, such as exploring more sophisticated perturbation methods and further improving the embedding perturbation loss approach.
สถิติ
The length of the given conversation segment and the hypotheses in the conversation entailment dataset are relatively short.
Flipping a few false-positive instances to negative can actually increase the overall testing accuracy of the baseline model.
Aggressive adversarial attacks can lower the testing accuracy of the baseline model from 70% to 56%.
คำพูด
"Transformer-based models are relatively robust against synonym-swapping. This means that the pre-trained language models have gained a good understanding of synonyms, and this understanding is embedded into their word embedding in vector space."
"Fine-tuning will cause the model to forget the information from the origin domain. To build a more robust model, we propose embedding perturbation loss, which includes the entailment prediction loss with the origin embeddings and perturbated embeddings."