toplogo
Sign In

Noise-Robust De-Duplication of Historical News Articles at Scale


Core Concepts
Robust identification of near-duplicate texts in large, noisy corpora is important for a variety of applications, including de-duplicating training datasets, reducing privacy risks, and evaluating test set leakage. This study develops and evaluates neural methods for text de-duplication that significantly outperform traditional N-gram based approaches.
Abstract
The key insights from this paper are: The authors build a large dataset called NEWS-COPY, which contains 27,210 articles with 122,876 positive duplicate pairs, to enable unbiased evaluation of text de-duplication methods. This dataset leverages the timeliness of historical news articles, where reproductions occur within a narrow time window, allowing for comprehensive hand-labeling. The neural methods, including a contrastively trained bi-encoder and a "re-ranking" approach that combines a bi-encoder and cross-encoder, significantly outperform traditional N-gram overlap and locality-sensitive hashing (LSH) methods. The Adjusted Rand Index (ARI) for the re-rank model is 93.7 and for the bi-encoder model is 91.5, versus 73.7 for LSH and 75.0 for N-gram overlap. The neural methods are highly scalable, with the bi-encoder model able to de-duplicate a 10 million article corpus on a single GPU card in under 12 hours. This is comparable to the scalability of the LSH approach. The authors apply their pre-trained bi-encoder model to two subsets of the C4 dataset - RealNews and patents - and find that it identifies many noisy duplicates missed by hashing, including those resulting from news aggregators, machine translation, and OCR errors. Overall, this work demonstrates the significant potential of neural methods for robust text de-duplication, even in the presence of various types of noise, and provides a public dataset and models to facilitate further research.
Stats
"Amongst duplicated pairs of articles in the NEWS-COPY test set, the average Jaccard similarity using 3-grams (4-grams, 5-grams) between pairs of reproduced articles is 30% (26%, 23%)." "19% of duplicates have no 10-grams in common and 31% have no 15-grams in common, often as a result of minor text noise."
Quotes
"By the 1910s and 1920s, most of the articles that Americans read in their local papers had either been bought or sold on the national news market... This constructed a broadly understood American 'way of life' that would become a touchstone of U.S. domestic politics and international relations throughout the twentieth century." Julia Guarneri

Key Insights Distilled From

by Emily Silcoc... at arxiv.org 04-25-2024

https://arxiv.org/pdf/2210.04261.pdf
Noise-Robust De-Duplication at Scale

Deeper Inquiries

How could the NEWS-COPY dataset be expanded to include more diverse types of text duplication beyond news articles, such as academic papers, social media posts, or product descriptions?

Expanding the NEWS-COPY dataset to include a wider variety of text duplication beyond news articles would involve collecting and curating datasets from different sources and domains. For academic papers, one could scrape repositories like arXiv or PubMed to gather a diverse set of scholarly articles. Social media posts could be obtained from platforms like Twitter, Facebook, or Reddit, ensuring a mix of user-generated content. Product descriptions could be sourced from e-commerce websites such as Amazon or eBay. To incorporate these diverse types of text into the dataset, the same process used for news articles could be applied. Object detection models could be trained to extract individual text components, followed by OCR processing to convert the text into a usable format. Manual review and annotation by human experts would then be necessary to identify duplicates and create clusters of duplicated texts. This process would need to be repeated for each new type of text to ensure a comprehensive dataset.

How could the insights from this work on noise-robust de-duplication be applied to improve the quality and reliability of large language models, which are known to memorize training data?

The insights from noise-robust de-duplication can be valuable in enhancing the quality and reliability of large language models, especially in addressing issues related to memorization of training data. Here are some ways these insights could be applied: Data Cleaning: By using noise-robust de-duplication methods, training datasets for language models can be cleaned of duplicate or near-duplicate instances. This can prevent the model from memorizing redundant information and improve generalization. Test Set Leakage: Identifying and removing duplicated or near-duplicated examples in test sets can help mitigate the risk of test set leakage, where models perform well due to memorization rather than true understanding. This can lead to more reliable evaluation of model performance. Fine-tuning: Noise-robust de-duplication techniques can be integrated into the fine-tuning process of language models. By ensuring that the fine-tuning data is free from duplicates, the model can learn more effectively and avoid overfitting. Regularization: The principles of noise-robust de-duplication can be used to develop regularization techniques that penalize the model for memorizing specific patterns or instances in the training data. This can encourage the model to focus on learning meaningful representations. Overall, applying noise-robust de-duplication insights can help in creating more robust and reliable large language models that are less prone to memorization and overfitting.

What other types of noise, beyond OCR errors and abridgement, could be introduced into the NEWS-COPY dataset to further test the robustness of de-duplication methods?

In addition to OCR errors and abridgement, introducing other types of noise into the NEWS-COPY dataset can provide a more comprehensive test of the robustness of de-duplication methods. Some additional types of noise that could be considered include: Plagiarism: Introducing instances of plagiarism where text is copied verbatim from other sources can challenge de-duplication methods to identify and differentiate between original and duplicated content. Paraphrasing: Including paraphrased versions of text passages can test the ability of de-duplication methods to detect similarity in meaning rather than exact word matches. Translation Errors: Introducing text that has been translated multiple times or contains errors from machine translation can simulate the presence of language variations and inaccuracies. Synonym Substitution: Adding noise by replacing words with synonyms or similar terms can test the sensitivity of de-duplication methods to variations in vocabulary. Text Obfuscation: Introducing intentional obfuscation techniques like random character insertion, deletion, or substitution can challenge the models to identify duplicates in noisy and distorted text. By incorporating these additional types of noise into the dataset, researchers can evaluate the robustness and effectiveness of de-duplication methods in handling a wider range of challenges in real-world text data.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star