Core Concepts
The data distribution has a measurable and statistically significant impact on both absolute (F1/Accuracy) and relative (Ranking) performance of NLP models. The variance in model out-of-distribution performance can be predicted using the differences in data distribution between source and target datasets.
Abstract
This paper presents an exploratory research on quantifying the impact that data distribution has on the performance and evaluation of NLP models. The authors propose an automated framework called "Benchmark Transparency" that measures the data point distribution across six different dimensions: ambiguity, difficulty, discriminability, length, noise, and perplexity.
The key findings are:
The data distribution has a measurable and statistically significant impact on both absolute (F1/Accuracy) and relative (Ranking) performance of models. Changes in data distribution can lead to a variance of 6-12 points in F1 score, which is substantial and puts into question the validity of standard evaluation approaches.
The impact of data distribution is often larger than the impact of changing the evaluation metric. 4 out of the 6 data features are more impactful than changing the metric from F1 to "exact" matching.
The authors demonstrate that the impact of data on evaluation is not just observable, but also predictable. They propose to use "benchmark transparency" as a method for comparing datasets and quantifying the similarity between them. The "dataset similarity vector" can be used to predict how well a model generalizes out of distribution.
The six data dimensions (ambiguity, difficulty, discriminability, length, noise, and perplexity) are empirically independent and capture orthogonal aspects of the data. They have different levels of impact on model performance.
The authors argue that a reliable evaluation framework needs to identify and quantify the factors in the environment that largely affect the reported model performance. Incorporating data-centric features can increase the reliability of evaluation, improve the use of NLP benchmarks, and provide a more accurate approximation for out-of-distribution model performance.
Stats
The data distribution has a measurable and statistically significant impact on both absolute (F1/Accuracy) and relative (Ranking) performance of NLP models.
A change in F1 by 6 – 12 points is substantial and statistically significant.
4 out of the 6 data features are more impactful than changing the metric from F1 to "exact" matching.
Quotes
"A change in F1 by 6 – 12 points is substantial and statistically significant and puts in question the validity of standard evaluation approaches."
"We argue that a reliable evaluation framework needs to identify the factors in the environment that largely affect the reported model performance."