The paper presents the Sinhala Offensive Language Dataset (SOLD), which is the largest annotated dataset for detecting offensive content in the Sinhala language. The dataset contains 10,000 tweets annotated at both the sentence and token level, allowing for the development of explainable models.
The authors first describe the data collection and annotation process. They collected 10,500 tweets using predefined keywords and filtered for Sinhala script. The tweets were then annotated by a team of 10 native Sinhala speakers, who labeled each tweet as offensive or not offensive at the sentence level. If a tweet was labeled as offensive, the annotators also highlighted the specific tokens that contributed to the offensiveness.
The authors then conduct experiments using various machine learning models, including support vector machines, BiLSTMs, CNNs, and transformer-based models, to perform sentence-level and token-level offensive language identification on the SOLD dataset. The results show that transformer models, particularly XLM-R and XLM-T, outperform the other approaches, achieving macro F1 scores of 0.83 and 0.82 respectively for sentence-level classification.
For token-level classification, the authors also explore weakly supervised learning using LIME, a technique that can identify offensive tokens without requiring token-level annotations. While the transformer-based models perform better than the BiLSTM approach, the weakly supervised LIME-based methods do not achieve the same level of performance.
The authors highlight the importance of the SOLD dataset, as it is the first large publicly available offensive language dataset compiled for the Sinhala language. They also introduce SemiSOLD, a larger semi-supervised dataset with more than 145,000 Sinhala tweets, which can be used to further improve the performance of offensive language identification models.
Overall, the paper makes significant contributions to the field of offensive language detection in low-resource languages, providing a valuable dataset and insights into the performance of various machine learning approaches for Sinhala.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問