HateModerate: Testing Hate Speech Detectors against Content Moderation Policies
מושגי ליבה
Automated hate speech detectors' conformity to content policies is crucial for transparent and accountable content moderation.
תקציר
Existing works focus on automated hate speech detection to protect users from harmful content.
HateModerate dataset created to test automated content moderators against Facebook's 41 hate speech policies.
State-of-the-art hate speech detectors show substantial failures in conforming to content policies.
Fine-tuning models with HateModerate data improves conformity while maintaining performance.
Dataset statistics, annotation process, and evaluation metrics detailed.
HateModerate
סטטיסטיקה
"A unified taxonomy of harmful content." - Banko et al., 2020
"Facebook specifies 41 community standards guidelines for moderating hate speech." - Facebook, 2022
"Google’s automatic content moderator detected 95% unwanted content before it is seen by a user." - Google, 2023b
ציטוטים
"Content moderation rules are often uniquely defined, existing hate speech datasets cannot directly answer this question."
"Models generally have high failure rates for non-hateful examples."
"Our dataset highlights the importance of investigating hate speech detectors’ conformity to content policies."