Noise-Robust De-Duplication of Historical News Articles at Scale
Robust identification of near-duplicate texts in large, noisy corpora is important for a variety of applications, including de-duplicating training datasets, reducing privacy risks, and evaluating test set leakage. This study develops and evaluates neural methods for text de-duplication that significantly outperform traditional N-gram based approaches.