The paper introduces GAHD, a new German Adversarial Hate speech Dataset, collected through four rounds of dynamic adversarial data collection (DADC).
In the first round (R1), annotators freely created adversarial examples to trick the target model. In the subsequent rounds, the authors explored new strategies to support the annotators:
The resulting GAHD dataset contains 10,996 examples, with 42.4% labeled as hate speech. Experiments show that training on GAHD substantially improves the robustness of the target model, with 18-20 percentage point increases in macro F1 on in-domain and out-of-domain test sets. The authors further find that mixing multiple support strategies for annotators leads to the most consistent improvements.
Benchmarking on GAHD reveals that it is a challenging dataset, with only GPT-4 among the tested large language models and commercial APIs achieving over 80% macro F1.
To Another Language
from source content
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Jani... lúc arxiv.org 03-29-2024
https://arxiv.org/pdf/2403.19559.pdfYêu cầu sâu hơn