Core Concepts
LLMs struggle to match smaller models in zero-shot settings, prompting strategies impact accuracy significantly.
Abstract
This article explores the impact of prompt complexity on zero-shot classification using Large Language Models (LLMs) in Computational Social Science. The study evaluates the performance of two LLMs, GPT and LLaMA-OA, on six classification tasks. Different prompting strategies are tested to understand their effects on classification accuracy. Results show that while LLMs can outperform simple baselines like Logistic Regression, they still fall short compared to fine-tuned models like BERT-large. The study highlights the importance of selecting effective prompt strategies and the potential benefits of using synonyms in prompts to improve model performance.
Directory:
Abstract:
Instruction-tuned LLMs exhibit impressive language understanding.
Zero-shot performance evaluated on six CSS tasks with different prompting strategies.
Introduction:
Transfer learning facilitated by instruction fine-tuning for LLMs.
Importance of understanding capabilities and limitations for CSS tasks.
Methodology:
Four different prompting strategies tested: Basic Instruction, Task and Label Description, Few-sample Prompting, Memory Recall.
Synonyms used to replace original labels in prompts for improved performance.
Data:
Six datasets selected covering various CSS tasks with manual annotations.
Experimental Setup:
Comparison of zero-shot classification results between LLMs and baselines (Logistic Regression, BERT-large).
Evaluation metrics include Accuracy and F1 scores.
Results:
GPT performs better than LLaMA-OA across most prompt settings.
Adding complexity to prompts does not always enhance model performance.
Error Analysis:
Shared errors observed across synonym settings indicating model limitations.
Conclusion:
Recommendations for developing effective prompts for zero-shot classification tasks using LLMs.
Future work includes exploring advanced prompt methods and addressing data leakage concerns.
Stats
Due to the opaque nature of training data, it is uncertain if datasets were included beforehand.
High accuracy achieved by LLMs suggests potential use as data annotation tools in CSS tasks.
Quotes
"LLMs can be employed as strong baseline models for zero-shot classification tasks."
"Replacing original labels with synonyms allows models to better understand task requirements."