The paper introduces a new dataset called ViTHSD for targeted hate speech detection on Vietnamese social media texts. The dataset contains 10,000 comments, each labeled with specific targets (individuals, groups, religion/creed, race/ethnicity, politics) and three levels of hate speech (clean, offensive, hate).
The authors propose a baseline model that combines the Bi-GRU-LSTM-CNN architecture with pre-trained BERTology language models (BERT, XLM-R, PhoBERT, VELECTRA, ViSoBERT) to leverage the power of text representation. The models are evaluated on the target detection task and the target with level detection task, using metrics like precision, recall, and F1-score.
The results show that the XLM-R model performs the best on the target detection task, while the ViSoBERT model achieves the highest scores on the target with level detection task. The authors also propose a methodology to integrate the baseline model into an online streaming system for real-time detection of hateful comments on social media platforms.
The error analysis reveals that the model struggles with social media language, such as slang, abbreviations, and code-mixing, which can lead to misclassifications. The authors suggest incorporating lexicon normalization as a pre-processing step to address this challenge.
To Another Language
from source content
arxiv.org
Głębsze pytania