Conceitos Básicos
Contamination, where evaluation datasets are included in the training data, poses a critical challenge to the integrity and reliability of large language models (LLMs). This paper provides a comprehensive survey of data and model contamination detection methods, and introduces the open-source LLMSanitize library to help the community centralize and share implementations of contamination detection algorithms.
Resumo
This paper explores the critical issue of contamination in large language models (LLMs), where evaluation datasets are included in the training data, leading to inflated performance and unreliable model evaluation.
The authors first categorize contamination into two broad types: data contamination, where the evaluation dataset overlaps with the training set, and model contamination, where the model has seen the evaluation data during pre-training or fine-tuning.
For data contamination, the authors review various string matching, embeddings similarity, and LLM-based methods to detect overlap between training and evaluation datasets. These include techniques used in models like GPT-2, GPT-3, PaLM, and Llama-2.
For model contamination, the authors discuss performance analysis approaches that leverage recent datasets not seen during pre-training, as well as model completion techniques that analyze model outputs and likelihoods to detect memorization of training data. They also cover LLM-based model contamination detection methods like guided prompting and the Data Contamination Quiz.
The authors then discuss best practices for the community, such as encrypting evaluation datasets, scanning new benchmarks for contamination, and avoiding leaking data to closed-source APIs. They also highlight emerging contamination-free evaluation benchmarks like LatestEval, WIKIMIA, KIEval, and LiveCodeBench.
Finally, the authors introduce the open-source LLMSanitize library, which implements major data and model contamination detection algorithms to help centralize and share these methods with the community. They demonstrate the library's capabilities by applying several model contamination detection techniques to popular LLMs on various datasets.
Estatísticas
GPT-2 found 1-6% overlap between common LM datasets' test sets and the WebText training set, with an average of 3.2%.
GPT-3 found large contamination problems for common datasets like Wikipedia language modeling benchmarks, Children's Book Test, Quac, and SQuAD 2.0.
Dodge et al. found varied contamination in C4, ranging from less than 2% to over 50%.
PaLM and GPT-4 found that contamination has little effect on their reported zero-shot results.
Llama-2 found contamination levels ranging from 1% to 47% across popular benchmarks.
Deng et al. found high overlap between TruthfulQA and the pre-training datasets The Pile and C4.
Citações
"Contamination poses a multifaceted challenge, threatening not only the technical accuracy of LLMs but also their ethical and commercial viability."
"As businesses increasingly integrate AI-driven insights into their strategic planning and operational decisions, the assurance of data purity becomes intertwined with potential market success and valuation."
"Contamination becomes a critical issue: LLMs' performance may not be reliable anymore, as the high performance may be at least partly due to their previous exposure to the data."