The content discusses the prevalence of human label variation and annotation errors in NLP benchmarks, emphasizing the need to distinguish between the two. It introduces the VARIERR dataset and methodology for teasing apart error from signal, focusing on the NLI task in English. The study evaluates automatic error detection methods and GPTs, highlighting the underperformance of traditional AED methods compared to humans and GPTs. The results suggest that errors are often concealed under human label variation, emphasizing the importance of improving data quality and trust in NLP systems.
Egy másik nyelvre
a forrásanyagból
arxiv.org
Mélyebb kérdések