Sign In

Comprehensive Kazakh Sentiment Analysis Dataset: KazSAnDRA

Core Concepts
This study presents KazSAnDRA, the first and largest publicly available dataset for Kazakh sentiment analysis, comprising 180,064 reviews with numerical ratings from 1 to 5 that quantitatively represent customer attitudes. The study also developed and evaluated four machine learning models for both polarity and score classification, with the most successful model achieving an F1-score of 0.81 for polarity classification and 0.39 for score classification on the test sets.
The study presents the development of KazSAnDRA, a comprehensive dataset for Kazakh sentiment analysis. The dataset was collected from four domains: digital mapping and navigation services, online marketplaces, an online library, and an online store for Android devices. The dataset comprises a total of 180,064 reviews, with each review accompanied by a numerical rating from 1 to 5 to represent the customer's attitude. The study highlights the variations in Kazakh reviews, which can include a combination of Cyrillic and Latin characters, a mixture of Kazakh and Russian vocabulary, or solely Cyrillic script with Russian characters substituting Kazakh ones. To evaluate the effectiveness of KazSAnDRA, the study utilized the dataset for two tasks: polarity classification (PC), which involves predicting whether a review is positive or negative, and score classification (SC), which involves predicting the score of a review on a scale of 1 to 5. The data pre-processing stage involved several essential steps, such as removing emojis, lowercasing the text, eliminating punctuation, and handling consecutive characters. The dataset was then divided into training, validation, and test sets, maintaining a ratio of 80/10/10. To address the data imbalance, the study employed random oversampling (ROS) and random undersampling (RUS) techniques. Four multilingual machine learning models were utilized for the sentiment classification tasks: mBERT, XLM-R, RemBERT, and mBART-50. The experimental analysis involved evaluating the models on both balanced and imbalanced training data. The most successful model, RemBERT, achieved an F1-score of 0.81 for the PC task and 0.39 for the SC task on the test sets. The study discusses several challenges and potential improvements for the dataset, such as addressing spelling errors, effectively handling code-switching, and applying lemmatization techniques. Additionally, the authors suggest the need for formulating sentiment annotation guidelines in the Kazakh language to address the subjectivity in rating assignments by individual authors. The dataset and fine-tuned models are available for unrestricted access and can be freely downloaded from the authors' GitHub repository under the Creative Commons Attribution 4.0 International License (CC BY 4.0).
The dataset comprises a total of 180,064 reviews. The reviews are distributed across four domains: Appstore (135,073), Market (30,289), Mapping (8,897), and Bookstore (5,805). The reviews have numerical ratings ranging from 1 to 5, with the following distribution: 1 (25,235), 2 (4,929), 3 (7,262), 4 (11,617), and 5 (131,021).
"The dataset we present includes reviews containing both exclusive Kazakh vocabulary and words from other languages (Russian, English, and Arabic), making it the largest dataset available for Kazakh sentiment analysis." "The highest F1-score on the test sets was 0.81 for polarity classification and 0.39 for score classification."

Key Insights Distilled From

by Rustem Yeshp... at 03-29-2024

Deeper Inquiries

What other techniques, such as data augmentation or linguistic analysis, could be explored to further improve the performance of sentiment classification models on the KazSAnDRA dataset

To further enhance the performance of sentiment classification models on the KazSAnDRA dataset, several techniques can be explored: Data Augmentation: Implementing data augmentation techniques like back-translation can help increase the diversity and quantity of the dataset. By translating reviews into another language and then back to Kazakh, new variations of the reviews can be generated, enriching the dataset and improving model generalization. Linguistic Analysis: Conducting a more in-depth linguistic analysis of the reviews can help in addressing code-switching issues between Kazakh and Russian, as well as identifying and correcting spelling errors. This analysis can also involve lemmatization to normalize words and improve the overall quality of the text data. Fine-tuning Pre-trained Models: Fine-tuning pre-trained language models specifically for the Kazakh language can lead to better performance on sentiment analysis tasks. Models like mBERT and XLM-R can be further fine-tuned on Kazakh-specific sentiment analysis data to capture language nuances effectively. Domain-specific Feature Engineering: Incorporating domain-specific features or embeddings related to the different review sources (Mapping, Market, Bookstore, Appstore) can help the models better understand the context and sentiment expressed in reviews from each domain. Ensemble Learning: Implementing ensemble learning techniques by combining predictions from multiple models can potentially improve the overall performance and robustness of sentiment classification models on the dataset.

How could the subjectivity in rating assignments by individual authors be addressed to enhance the reliability of the dataset for sentiment analysis tasks

To address the subjectivity in rating assignments by individual authors and enhance the reliability of the dataset for sentiment analysis tasks, the following strategies can be considered: Standardized Annotation Guidelines: Developing clear and standardized guidelines for sentiment annotation can help ensure consistency in rating assignments across different authors. These guidelines should define criteria for assigning positive, negative, or neutral sentiments to reviews, reducing subjectivity. Inter-Annotator Agreement: Implementing inter-annotator agreement measures where multiple annotators independently label the same reviews can help identify discrepancies in ratings. Resolving disagreements through discussion and establishing consensus can lead to more reliable annotations. Quality Control Checks: Introducing quality control checks during the review collection process can help filter out reviews with inappropriate content, spelling errors, or unclear sentiments. Only reviews meeting predefined quality standards should be included in the dataset. Reviewer Training: Providing training to reviewers on sentiment analysis principles, language nuances, and rating criteria can improve the consistency and accuracy of their rating assignments. Reviewers should be familiar with the guidelines and best practices for sentiment annotation. Regular Review Audits: Periodic audits of the dataset by domain experts or linguists can help identify and rectify any inconsistencies or errors in the sentiment annotations. Audits can ensure the dataset maintains high quality and reliability for sentiment analysis tasks.

What potential applications and use cases could the KazSAnDRA dataset enable beyond sentiment analysis, such as cross-lingual studies or language modeling

The KazSAnDRA dataset can enable various applications and use cases beyond sentiment analysis, including: Cross-Lingual Studies: The multilingual nature of the dataset, with reviews in both Kazakh and other languages like Russian, English, and Arabic, can facilitate cross-lingual studies. Researchers can explore language transfer learning, sentiment analysis across different languages, and language adaptation techniques using the diverse reviews in the dataset. Language Modeling: The dataset can be utilized for training and evaluating language models specific to the Kazakh language. By leveraging the textual data in KazSAnDRA, researchers can develop language models that capture the linguistic characteristics, sentiment expressions, and vocabulary of Kazakh reviews, contributing to advancements in natural language processing for Kazakh. Product and Service Analysis: Beyond sentiment analysis, the dataset can be used for product and service analysis in various domains like digital mapping, online marketplaces, bookstores, and app stores. Researchers can extract insights on customer preferences, satisfaction levels, and trends in different sectors based on the reviews collected, aiding businesses in improving their offerings and customer experiences. Sentiment Trend Analysis: By analyzing the sentiment trends over time within the dataset, researchers can track changes in customer attitudes, identify emerging patterns, and predict future sentiment shifts. This can be valuable for businesses to adapt their strategies, address customer concerns, and enhance brand reputation based on sentiment analysis insights.