Alam, S., Ishmam, M. F., Alvee, N. H., Siddique, M. S., Hossain, M. A., & Kamal, A. R. M. (2024). BNSENTMIX: A Diverse Bengali-English Code-Mixed Dataset for Sentiment Analysis. arXiv preprint arXiv:2408.08964v2.
This paper addresses the lack of large-scale, diverse datasets for sentiment analysis in code-mixed Bengali-English by introducing BnSentMix, a new dataset designed to overcome the limitations of existing resources in this domain.
The researchers collected over 3 million user-generated content samples from YouTube, Facebook, and e-commerce platforms. They developed a novel automated text filtering pipeline using fine-tuned language models to identify code-mixed Bengali-English text. The dataset was annotated with four sentiment labels: positive, negative, neutral, and mixed. Eleven baseline models, including classical machine learning, recurrent neural network variants, and transformer-based pre-trained language models, were evaluated on the dataset.
The researchers curated a dataset of 20,000 samples with four sentiment labels, achieving substantial inter-annotator agreement (Cohen’s Kappa κ = 0.86). Their automated code-mixed text detection pipeline achieved an accuracy of 94.56%. Among the evaluated baselines, BERT achieved the highest performance with 69.5% accuracy and 68.8% F1 score.
The creation and public availability of BnSentMix is a critical step towards developing inclusive NLP tools for code-mixed languages. The promising results achieved by baseline models demonstrate the dataset's potential for advancing sentiment analysis research in code-mixed Bengali-English.
This research significantly contributes to the field of NLP by providing a valuable resource for a previously under-resourced language pair. The development of effective sentiment analysis tools for code-mixed Bengali-English has implications for various applications, including social media monitoring, customer feedback analysis, and market research.
The label distribution in BnSentMix is slightly imbalanced, with a lower proportion of samples labeled as "mixed" sentiment. Future research could explore techniques to address this imbalance and further improve model performance on this specific sentiment category. Additionally, investigating the impact of annotator bias and exploring methods to mitigate it could enhance the dataset's quality and generalizability.
Ke Bahasa Lain
dari konten sumber
arxiv.org
Wawasan Utama Disaring Dari
by Sadia Alam, ... pada arxiv.org 10-22-2024
https://arxiv.org/pdf/2408.08964.pdfPertanyaan yang Lebih Dalam