Core Concepts
Introducing a novel method called "incremental utility" to estimate how much additional knowledge a demonstration brings to a large language model for few-shot in-context learning tasks, and showing its effectiveness compared to previous utility estimation approaches.
Abstract
This paper presents an analysis on different utility functions for selecting demonstrations in few-shot in-context learning (ICL) with large language models (LLMs). The authors introduce a novel method called "incremental utility" that estimates how much incremental knowledge a demonstration brings to the LLM by contrasting its 0-shot and 1-shot performance.
The key highlights are:
The authors compare two types of utility functions: (1) the LLM's output probability of generating the ground-truth output, and (2) a task-specific reward function given the LLM's prediction.
The output probability is effective when the probability values are well distributed across the whole range, especially on classification tasks. The downstream metric reward is more robust for longer outputs like in segmentation and translation tasks.
The proposed incremental utility further improves ICL by effectively training the reranking model using contrastive examples that show both positive and negative impacts of demonstrations.
Constrained retrieval, which ensures equal coverage of class labels in the retrieved candidates, is helpful when the retrieved set is imbalanced.
The authors provide general instructions on when to use the different utility functions based on the task characteristics.
The analysis is comprehensive, covering binary/multi-class classification, segmentation, and translation tasks across multiple languages. The authors also discuss the generalization of their findings by experimenting with different LLMs and retrievers.
Stats
The authors report the following key statistics and figures:
The output probability (OP) values are well distributed across the whole range [0.0, 1.0] on the classification datasets, while the downstream metric (DM) values show a more balanced distribution on the non-classification datasets.
60-80% of the uOP values fall into the [0.0, 0.05) bucket on the SSENT and XML-MT datasets, indicating the LLM's difficulty in generating long text outputs.
The number of contrastive training examples, where a demonstration has both positive and negative impacts, varies across datasets and correlates with the effectiveness of the incremental utility.