Annotators often make errors in labeling data, and distinguishing between annotation errors and human label variation is crucial for improving NLP benchmarks.
Human label variation and annotation errors in NLP benchmarks require a systematic methodology for differentiation.