Core Concepts
Aggregating the predictions of multiple instruction-tuned language models can outperform individual models, but still falls short of supervised learning approaches on subjective text classification tasks.
Abstract
The paper explores the feasibility of using four open-source instruction-tuned language models (Flan-T5, Flan-UL2, T0, and Tk-Instruct) as "annotators" for five subjective text classification tasks across four languages (English, French, German, and Spanish).
The key findings are:
The language models exhibit specialization, with different models performing better on different tasks and languages. This suggests that aggregating their predictions could be beneficial.
Aggregating the model predictions using majority voting or a Bayesian annotation model (MACE) indeed outperforms the individual models on average. The MACE aggregation, in particular, correlates well with the actual competence of each model.
However, even the best aggregated performance is still well below that of simple supervised learning models, let alone more advanced Transformer-based supervised models. The performance gap is over 10 F1 points on average.
Surprisingly, few-shot learning with carefully selected seed examples does not consistently improve over zero-shot learning, likely due to the difficulty in selecting high-quality exemplars.
The authors discuss the trade-offs between using language models versus human annotators, considering aspects like performance, cost, bias, and ethical implications. They conclude that while language model annotation can be a quick and cost-effective solution, human annotation is still vital for achieving high performance, especially on subjective tasks.
Stats
"Aggregated labels are 4.2 F1-points better than the average LLM."
"Even the best-aggregated performance is still well below that of even simple supervised models trained on the same data, and substantially lower than Transformer-based supervised models (by over 10 F1 points on average)."
Quotes
"Different models indeed excel on some tasks or languages, but not on others. Some models even specialize on certain labels in a given task, but perform poorly on the others."
"Aggregating several ZSL-prompted LLMs is better than using a single LLM. Surprisingly, FSL-prompting is too varied to consistently improve performance."
"However, treating LLMs as annotators cannot rival using human annotators for fine-tuning or supervised learning."