The paper discusses the challenges in evaluating the in-context learning (ICL) classification performance of language models, such as small dataset sizes, extensive prompt selection using the validation set, and intentionally difficult tasks that lead to near-random performance.
The authors introduce a stronger random baseline that accounts for validation set reuse and small datasets. This baseline calculates the expected maximum accuracy across multiple random classifiers, rather than just the expected accuracy of a single random classifier.
The key insights are:
The standard random baseline is stable when the evaluation set is used only once or when the dataset is large, but it does not account for validation set reuse, which is a common practice.
The stronger random baseline, which is the expected maximum accuracy across multiple random classifiers, provides a more appropriate comparison, especially for small datasets and when the validation set is reused multiple times.
When choosing the best prompt demonstrations across six quantized language models applied to 16 BIG-bench Lite tasks, more than 20% of the few-shot results that exceed the standard baseline do not exceed the stronger random baseline.
The stronger random baseline is also a better predictor of held-out test set performance than the standard baseline, helping to avoid unnecessary test set evaluations.
The authors provide a simple method to calculate the maximum random baseline and release code for a drop-in replacement baseline. They argue that reporting the number of validation set evaluations, in addition to the validation set size, should be a best practice for ICL evaluation.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Gregory Yaun... klo arxiv.org 04-22-2024
https://arxiv.org/pdf/2404.13020.pdfSyvällisempiä Kysymyksiä