The article presents a semi-automated text sanitization tool designed to help whistleblowers mitigate the risk of re-identification while preserving key details about the wrongdoing they are reporting.
The key highlights are:
The tool leverages natural language processing techniques to automatically identify textual elements that pose re-identification risks, such as named entities, modifiers, and stylometric features. It assigns default risk levels to these elements.
The tool allows the whistleblower to interactively adjust the risk levels based on their contextual knowledge, enabling them to strike a balance between anonymity and retaining important details.
The tool applies various anonymization operations, including generalization, perturbation, and suppression, to the high-risk textual elements. It then uses a fine-tuned large language model to rephrase the sanitized text, preserving coherence and a neutral writing style.
The authors evaluate the tool's effectiveness in reducing authorship attribution accuracy while maintaining semantic similarity and sentiment preservation. The results show that the tool can significantly reduce authorship attribution accuracy from 98.81% to 31.22%, while retaining up to 73.1% of the original content's semantics.
The tool is also evaluated on the Text Anonymization Benchmark dataset, demonstrating its effectiveness in masking direct and quasi-identifiers in real-world whistleblower testimonies.
Til et andet sprog
fra kildeindhold
arxiv.org
Vigtigste indsigter udtrukket fra
by Dimitri Stau... kl. arxiv.org 05-03-2024
https://arxiv.org/pdf/2405.01097.pdfDybere Forespørgsler