MaCmS: Magahi Code-mixed Dataset for Sentiment Analysis
核心概念
The authors introduce MaCMS, the first Magahi-Hindi-English code-mixed dataset for sentiment analysis, aiming to understand language preferences and emotions in code-mixing.
摘要
The content introduces MaCMS, a dataset for sentiment analysis in code-mixed languages. It discusses the challenges of sentiment analysis in low-resourced languages and presents linguistic and statistical analyses of the dataset. Baseline models are evaluated, showing varying performance across different models.
MaCmS
统计
The dataset includes 5000 sentences and 750 span sentences for sentiment analysis.
Inter-annotator agreement scores were 0.78 for sentence-level annotation and 0.76 for span-level annotation.
XLM-R model achieved the highest F1 score of 0.75 for sentence-level sentiment analysis.
引用
"We also provide some baseline models for sentiment analysis at the sentence and language-specific span levels."
"Sentiment analysis is commonly regarded as a task involving categorizing text into one of three categories: positive, negative, or neutral."
"The results do not agree with previous studies which state that speakers prefer the first language to express negative sentiments."
更深入的查询
How can the dataset be expanded to include more diverse content from various sources?
To expand the dataset and include more diverse content, researchers can consider scraping data from additional social media platforms beyond YouTube. Platforms like Twitter, Facebook, Instagram, or regional forums could provide a wider range of language use and sentiment expressions. Collaborating with local communities, online forums, or news outlets in Magahi-speaking regions could also offer authentic and varied linguistic samples. Furthermore, incorporating user-generated content from blogs, websites, or other online platforms where code-mixing is prevalent would enrich the dataset's diversity.
What implications does this research have on understanding cultural attitudes through language preferences?
This research offers valuable insights into how cultural attitudes are expressed through language preferences in code-mixed contexts. By analyzing sentiment patterns across languages like Magahi-Hindi-English, researchers can decipher speakers' emotional states and their affiliations towards different cultures or traditions. Understanding these language choices provides a window into individuals' identities within multilingual societies and sheds light on how sentiments are intertwined with cultural nuances. This knowledge can aid in sociolinguistic studies focusing on identity formation, community dynamics, and intercultural communication.
How can these findings be applied to improve sentiment analysis in other low-resourced languages?
The findings from this study can serve as a blueprint for enhancing sentiment analysis in other low-resourced languages facing similar challenges of code-mixing and limited resources. Researchers can adapt the methodologies employed in creating the MaCmS dataset to develop annotated datasets for sentiment analysis in different linguistic contexts. Leveraging deep learning models trained on multilingual data such as mBERT or XLM-R could help capture complex linguistic features present in code-mixed text across various languages effectively. Additionally, insights gained from studying language preferences and sentiment expressions could inform the design of tailored sentiment analysis tools for specific low-resourced languages by considering unique linguistic characteristics and cultural influences.