Human label variation and annotation errors in NLP benchmarks require a systematic methodology for differentiation.
Annotators often make errors in labeling data, and distinguishing between annotation errors and human label variation is crucial for improving NLP benchmarks.