Madaan, L., Esiobu, D., Stenetorp, P., Plank, B., & Hupkes, D. (2024). Lost in Inference: Rediscovering the Role of Natural Language Inference for Large Language Models. arXiv preprint arXiv:2411.14103.
This research paper investigates the continued relevance and usefulness of natural language inference (NLI) benchmarks for evaluating the performance of large language models (LLMs). The authors aim to determine if NLI tasks can effectively discriminate between LLMs of different sizes and qualities, track their progress during training, and reveal insights into their alignment with human judgment patterns.
The researchers evaluate five different NLI benchmarks (SNLI, MNLI, HANS, ANLI, and αNLI) across six LLMs from the Llama and Mistral families, varying in size and architecture. They analyze model performance in zero-shot and few-shot settings, track accuracy during pre-training, and examine the impact of data contamination. Additionally, they utilize the ChaosNLI dataset, which includes multiple human annotations for a subset of the benchmarks, to assess the alignment between model predictions and human judgment distributions, particularly in cases of ambiguity or disagreement.
The study concludes that NLI benchmarks remain valuable tools for LLM evaluation and development. They offer insights into model capabilities, training dynamics, and alignment with human reasoning patterns. The authors suggest that monitoring the divergence between model and human judgment distributions, especially in ambiguous scenarios, can guide future research and development efforts.
This research contributes to the ongoing discussion on robust and informative evaluation methodologies for LLMs. By demonstrating the continued relevance of NLI benchmarks, the study encourages their wider adoption in LLM evaluation practices. The findings regarding model-human judgment alignment provide valuable insights for improving LLM robustness and reliability, particularly in real-world applications where diverse interpretations and subjective judgments are common.
The study primarily focuses on pre-trained base LLMs, leaving the evaluation of instruction-tuned models for future work. Further research could explore the impact of different prompting techniques and fine-tuning strategies on NLI performance. Additionally, investigating the reasons behind model-human judgment discrepancies and developing methods to mitigate them represent promising avenues for future research.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询