The author explores the distinction between annotation errors and human label variation in NLI benchmarks, introducing a new dataset and methodology to address this issue.