Quantifying the Impact of Data Distribution on the Evaluation of NLP Models
The data distribution has a measurable and statistically significant impact on both absolute (F1/Accuracy) and relative (Ranking) performance of NLP models. The variance in model out-of-distribution performance can be predicted using the differences in data distribution between source and target datasets.