核心概念
Large language models can effectively perform sentiment analysis on Bangla text through zero- and few-shot prompting, though fine-tuned models still outperform them.
摘要
This study presents a comprehensive evaluation of zero- and few-shot prompting with large language models (LLMs) for Bangla sentiment analysis. The authors developed a new dataset called MUBASE, which contains 33,606 manually annotated Bangla news tweets and Facebook comments.
The key highlights and insights from the study are:
- The authors compared the performance of classical models, fine-tuned models, and LLMs (Flan-T5, GPT-4, BLOOMZ) in both zero-shot and few-shot settings.
- Fine-tuned models, particularly the monolingual BanglaBERT, consistently outperformed the LLMs across various metrics.
- While the LLMs surpassed the random and majority baselines, they fell short compared to the fine-tuned models.
- The performance of the smaller BLOOMZ model (560m) was better than the larger one (1.7B), suggesting the need for more training data to effectively train large models.
- The authors observed little to no performance difference between zero- and few-shot learning with the GPT-4 model, while BLOOMZ yielded better performance in the majority of zero- and few-shot experiments.
- The authors also explored the impact of different prompting strategies, finding that native language instructions achieved comparable performance to English instructions for Bangla sentiment analysis.
- The authors conducted an error analysis, revealing that Flan-T5 struggled to predict the negative class, BLOOMZ failed to label posts as neutral, and GPT-4 had difficulty with the positive class.
Overall, the study provides valuable insights into the effectiveness of LLMs for Bangla sentiment analysis and highlights the continued need for fine-tuned models, especially for low-resource languages.
統計資料
The dataset contains 33,606 Bangla news tweets and Facebook comments.
The dataset is divided into 23,472 training, 3,427 development, and 6,707 test instances.
The class distribution is: 10,560 positive, 6,197 neutral, and 16,849 negative instances.
引述
"Fine-tuned models, particularly the monolingual BanglaBERT, consistently outperformed the LLMs across various metrics."
"While the LLMs surpassed the random and majority baselines, they fell short compared to the fine-tuned models."
"The performance of the smaller BLOOMZ model (560m) was better than the larger one (1.7B), suggesting the need for more training data to effectively train large models."