The paper introduces GAHD, a new German Adversarial Hate speech Dataset, collected through four rounds of dynamic adversarial data collection (DADC).
In the first round (R1), annotators freely created adversarial examples to trick the target model. In the subsequent rounds, the authors explored new strategies to support the annotators:
The resulting GAHD dataset contains 10,996 examples, with 42.4% labeled as hate speech. Experiments show that training on GAHD substantially improves the robustness of the target model, with 18-20 percentage point increases in macro F1 on in-domain and out-of-domain test sets. The authors further find that mixing multiple support strategies for annotators leads to the most consistent improvements.
Benchmarking on GAHD reveals that it is a challenging dataset, with only GPT-4 among the tested large language models and commercial APIs achieving over 80% macro F1.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Jani... kl. arxiv.org 03-29-2024
https://arxiv.org/pdf/2403.19559.pdfDybere Forespørgsler