Core Concepts
Using Large Language Models to extract rationales for hate speech detection enables interpretable classifiers.
Stats
"Our comprehensive evaluation on a variety of social media hate speech datasets demonstrate: (1) the goodness of the LLM-extracted rationales, and (2) the surprising retention of detector performance even after training to ensure interpretability."
"Dataset statistics for explicit hate speech dataset from different social media platforms: GAB - 83.7% hate posts, YouTube - 33.6% hate posts, Reddit - 28.4% hate posts, Twitter - 37.6% hate posts."
Quotes
"While incorporating qualities of interpretability directly into deep neural network models such as pre-trained language model based detectors is challenging..."
"Our SHIELD framework performs at par with a simple HateBERT fine-tuned on the same dataset..."