Comprehensive Survey on Contamination in Large Language Models and the LLMSanitize Library
Contamination, where evaluation datasets are included in the training data, poses a critical challenge to the integrity and reliability of large language models (LLMs). This paper provides a comprehensive survey of data and model contamination detection methods, and introduces the open-source LLMSanitize library to help the community centralize and share implementations of contamination detection algorithms.