Detecting Pretraining Data from Large Language Models: A Study on Detection Methods and Real-World Applications
Large language models pose privacy risks due to undisclosed training data. MIN-K% PROB offers a novel approach for pretraining data detection, showing promising results in detecting copyrighted content and dataset contamination.