核心概念
Human label variation and annotation errors in NLP benchmarks require a systematic methodology for differentiation.
要約
The content discusses the prevalence of human label variation and annotation errors in NLP benchmarks, emphasizing the need to distinguish between the two. It introduces the VARIERR dataset and methodology for teasing apart error from signal, focusing on the NLI task in English. The study evaluates automatic error detection methods and GPTs, highlighting the underperformance of traditional AED methods compared to humans and GPTs. The results suggest that errors are often concealed under human label variation, emphasizing the importance of improving data quality and trust in NLP systems.
統計
VARIERR contains 7,574 validity judgments on 1,933 explanations for 500 re-annotated NLI items.
State-of-the-art AED methods significantly underperform compared to GPTs and humans.
GPT-4 is the best system but falls short of human performance.
引用
"Human label variation arises when annotators assign different labels to the same item for valid reasons, while annotation errors occur when labels are assigned for invalid reasons."
"Our methodology is applicable beyond NLI, offering fertile ground for future research on error versus plausible variation."