Core Concepts
LLMs offer a universal solution for TSA, but prompt design significantly influences their performance.
Abstract
The study delves into the impact of prompt design on Large Language Models (LLMs) for Targeted Sentiment Analysis (TSA) of news headlines. It compares zero-shot and few-shot prompting levels, evaluates predictive accuracy, and quantifies uncertainty in LLM predictions.
Fine-tuned encoder models like BERT show strong TSA performance but require labeled datasets. In contrast, LLMs offer a versatile approach without fine-tuning needs. However, their performance consistency is influenced by prompt design.
The study uses Croatian, English, and Polish datasets to compare LLMs and BERT models. Results show that increased prescriptiveness in prompts improves predictive accuracy but varies by model. LLM uncertainty quantification methods reflect subjectivity but do not align with human inter-annotator agreement.
Overall, the research provides insights into the potential of LLMs for TSA of news headlines and highlights the importance of prompt design in maximizing their performance.
Stats
SEN: F1 scores - GPT 3.5 Turbo: 61.3; GPT 4 Turbo: 65.9; Neural Chat: 59.8
STONE: F1 scores - Mistral: 56.1; Neural Chat: 66.3; BERT*: 63.6
Quotes
"Detecting sentiment through author's intent and news presentation is crucial for targeted sentiment analysis."
"LLM uncertainty tends to be well-calibrated but does not align with human subjectivity."