This paper explores the critical issue of contamination in large language models (LLMs), where evaluation datasets are included in the training data, leading to inflated performance and unreliable model evaluation.
The authors first categorize contamination into two broad types: data contamination, where the evaluation dataset overlaps with the training set, and model contamination, where the model has seen the evaluation data during pre-training or fine-tuning.
For data contamination, the authors review various string matching, embeddings similarity, and LLM-based methods to detect overlap between training and evaluation datasets. These include techniques used in models like GPT-2, GPT-3, PaLM, and Llama-2.
For model contamination, the authors discuss performance analysis approaches that leverage recent datasets not seen during pre-training, as well as model completion techniques that analyze model outputs and likelihoods to detect memorization of training data. They also cover LLM-based model contamination detection methods like guided prompting and the Data Contamination Quiz.
The authors then discuss best practices for the community, such as encrypting evaluation datasets, scanning new benchmarks for contamination, and avoiding leaking data to closed-source APIs. They also highlight emerging contamination-free evaluation benchmarks like LatestEval, WIKIMIA, KIEval, and LiveCodeBench.
Finally, the authors introduce the open-source LLMSanitize library, which implements major data and model contamination detection algorithms to help centralize and share these methods with the community. They demonstrate the library's capabilities by applying several model contamination detection techniques to popular LLMs on various datasets.
Para outro idioma
do conteúdo fonte
arxiv.org
Principais Insights Extraídos De
by Mathieu Rava... às arxiv.org 04-02-2024
https://arxiv.org/pdf/2404.00699.pdfPerguntas Mais Profundas