Core Concepts
Adversarial datasets, collected by exploiting model weaknesses, can improve the robustness of hate speech detection models.
Abstract
The paper introduces GAHD, a new German Adversarial Hate speech Dataset, collected through four rounds of dynamic adversarial data collection (DADC).
In the first round (R1), annotators freely created adversarial examples to trick the target model. In the subsequent rounds, the authors explored new strategies to support the annotators:
- R2: Annotators validated and expanded on English-to-German translated adversarial examples.
- R3: Annotators validated newspaper sentences that the target model had incorrectly classified as hate speech.
- R4: Annotators created contrastive examples by modifying challenging examples from previous rounds.
The resulting GAHD dataset contains 10,996 examples, with 42.4% labeled as hate speech. Experiments show that training on GAHD substantially improves the robustness of the target model, with 18-20 percentage point increases in macro F1 on in-domain and out-of-domain test sets. The authors further find that mixing multiple support strategies for annotators leads to the most consistent improvements.
Benchmarking on GAHD reveals that it is a challenging dataset, with only GPT-4 among the tested large language models and commercial APIs achieving over 80% macro F1.
Stats
"Hate speech detection models are only as good as the data they are trained on." (Introduction)
"Adversarial datasets, collected by exploiting model weaknesses, promise to fix this problem." (Introduction)
"GAHD contains 10,996 adversarial examples, with 42.4% labeled as hate speech." (Section 3.6)
"Training on GAHD leads to 18-20 percentage point increases in macro F1 on in-domain and out-of-domain test sets." (Section 4.1)
Quotes
"Adversarial datasets, collected by exploiting model weaknesses, promise to fix this problem."
"Mixing multiple support strategies for annotators leads to the most consistent improvements."