Core Concepts
Using Large Language Models to extract rationales for hate speech detection enables interpretable classifiers.
Abstract
Social media platforms can harbor hate speech due to anonymity.
Proposed SHIELD framework combines LLMs and hate speech classifiers for interpretability.
Evaluation on various datasets shows good performance and interpretability retention.
Methodology includes LLM feature extraction, embedding fusion, and classification.
Comparison with baselines like HateBERT and HateXplain demonstrates effectiveness.
Results show SHIELD framework maintains detection performance while enhancing interpretability.
Stats
"Our comprehensive evaluation on a variety of social media hate speech datasets demonstrate: (1) the goodness of the LLM-extracted rationales, and (2) the surprising retention of detector performance even after training to ensure interpretability."
"Dataset statistics for explicit hate speech dataset from different social media platforms: GAB - 83.7% hate posts, YouTube - 33.6% hate posts, Reddit - 28.4% hate posts, Twitter - 37.6% hate posts."
Quotes
"While incorporating qualities of interpretability directly into deep neural network models such as pre-trained language model based detectors is challenging..."
"Our SHIELD framework performs at par with a simple HateBERT fine-tuned on the same dataset..."