toplogo
Sign In

Sinhala Offensive Language Dataset (SOLD): Annotated Dataset for Detecting Offensive Content in Sinhala


Core Concepts
This paper introduces the Sinhala Offensive Language Dataset (SOLD), the largest annotated dataset for detecting offensive content in the Sinhala language. The dataset contains 10,000 tweets annotated at both the sentence and token level, enabling the development of explainable models for offensive language identification.
Abstract
The paper presents the Sinhala Offensive Language Dataset (SOLD), which is the largest annotated dataset for detecting offensive content in the Sinhala language. The dataset contains 10,000 tweets annotated at both the sentence and token level, allowing for the development of explainable models. The authors first describe the data collection and annotation process. They collected 10,500 tweets using predefined keywords and filtered for Sinhala script. The tweets were then annotated by a team of 10 native Sinhala speakers, who labeled each tweet as offensive or not offensive at the sentence level. If a tweet was labeled as offensive, the annotators also highlighted the specific tokens that contributed to the offensiveness. The authors then conduct experiments using various machine learning models, including support vector machines, BiLSTMs, CNNs, and transformer-based models, to perform sentence-level and token-level offensive language identification on the SOLD dataset. The results show that transformer models, particularly XLM-R and XLM-T, outperform the other approaches, achieving macro F1 scores of 0.83 and 0.82 respectively for sentence-level classification. For token-level classification, the authors also explore weakly supervised learning using LIME, a technique that can identify offensive tokens without requiring token-level annotations. While the transformer-based models perform better than the BiLSTM approach, the weakly supervised LIME-based methods do not achieve the same level of performance. The authors highlight the importance of the SOLD dataset, as it is the first large publicly available offensive language dataset compiled for the Sinhala language. They also introduce SemiSOLD, a larger semi-supervised dataset with more than 145,000 Sinhala tweets, which can be used to further improve the performance of offensive language identification models. Overall, the paper makes significant contributions to the field of offensive language detection in low-resource languages, providing a valuable dataset and insights into the performance of various machine learning approaches for Sinhala.
Stats
The SOLD dataset contains 10,000 tweets annotated for offensive and non-offensive content. 41% of the tweets in the dataset were labeled as offensive. The average length of the tweets is between 0-20 tokens.
Quotes
"SOLD is the first large publicly available offensive language dataset compiled for Sinhala." "We explore offensive language identification with cross-lingual embeddings and transfer learning. We take advantage of existing data in high-resource languages such as English to project predictions to Sinhala." "We investigate semi-supervised data augmentation. We create SemiSOLD; a larger semi-supervised dataset with more than 145,000 instances for Sinhala."

Key Insights Distilled From

by Tharindu Ran... at arxiv.org 03-29-2024

https://arxiv.org/pdf/2212.00851.pdf
SOLD

Deeper Inquiries

How can the findings from the SOLD dataset be applied to other low-resource languages to improve offensive language detection?

The findings from the SOLD dataset can be applied to other low-resource languages to enhance offensive language detection by leveraging transfer learning techniques. Since many low-resource languages lack sufficient annotated data for offensive language identification, models trained on larger, more resource-rich languages can be fine-tuned or adapted to work effectively in these low-resource settings. By utilizing pre-trained transformer models like XLM-R and XLM-T, which have shown promising results in the SOLD dataset for Sinhala, researchers can transfer knowledge and insights gained from one language to another. This transfer learning approach can help bootstrap offensive language detection systems in languages with limited resources, improving their performance and accuracy.

What are the potential challenges and limitations of using weakly supervised techniques like LIME for token-level offensive language identification in low-resource settings?

While weakly supervised techniques like LIME offer a valuable approach for token-level offensive language identification in low-resource settings, they come with certain challenges and limitations. One major challenge is the reliance on sentence-level labels for training the model, which may not capture the nuances and complexities of offensive language at the token level. This can lead to suboptimal performance in identifying offensive tokens accurately. Additionally, the interpretability and explainability of the model may be limited when using weakly supervised techniques, as the model's decisions are based on indirect signals rather than direct token-level annotations. Furthermore, the effectiveness of weakly supervised methods like LIME heavily depends on the quality and representativeness of the training data, which can be a limitation in low-resource settings where annotated data is scarce.

How can the SOLD dataset and the insights from this study be leveraged to develop real-world content moderation systems for social media platforms in Sri Lanka?

The SOLD dataset and the insights gained from this study can be instrumental in developing real-world content moderation systems for social media platforms in Sri Lanka. By training machine learning models on the SOLD dataset, social media platforms can implement more effective and accurate offensive language detection systems tailored to the Sinhala language. These models can help automate the process of flagging and removing offensive content, thereby improving the overall user experience and safety on social media platforms. Additionally, the token-level annotations in the SOLD dataset can enhance the explainability of the models, enabling content moderators to understand why certain content is flagged as offensive. This transparency can aid in making more informed decisions when moderating content and addressing offensive language effectively. Ultimately, leveraging the SOLD dataset can contribute to creating a safer and more inclusive online environment for users in Sri Lanka.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star