Contamination, where evaluation datasets are included in the training data, poses a critical challenge to the integrity and reliability of large language models (LLMs). This paper provides a comprehensive survey of data and model contamination detection methods, and introduces the open-source LLMSanitize library to help the community centralize and share implementations of contamination detection algorithms.


coremsg

comprehensive-survey-on-contamination-in-large-language-models-and-the-llmsanitize-library


Comprehensive Survey on Contamination in Large Language Models and the LLMSanitize Library