The content discusses the prevalence of human label variation and annotation errors in NLP benchmarks, emphasizing the need to distinguish between the two. It introduces the VARIERR dataset and methodology for teasing apart error from signal, focusing on the NLI task in English. The study evaluates automatic error detection methods and GPTs, highlighting the underperformance of traditional AED methods compared to humans and GPTs. The results suggest that errors are often concealed under human label variation, emphasizing the importance of improving data quality and trust in NLP systems.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Leon Weber-G... lúc arxiv.org 03-05-2024
https://arxiv.org/pdf/2403.01931.pdfYêu cầu sâu hơn