Core Concepts
Users can protect their data from large language models by embedding ghost sentences in their documents.
Abstract
Web user data is crucial for pre-trained large language models (LLMs).
Users can insert personal passphrases as ghost sentences to confirm if LLMs use their data.
Ghost sentences act as hidden guards within user documents, safeguarding data from unauthorized use.
Evaluation metrics like document and user identification accuracy are used to assess the effectiveness of ghost sentences.
Larger models and longer ghost sentences improve memorization performance.
Inserting ghost sentences in the latter half of a document enhances identification accuracy.
Different wordlists and training data domains impact memorization capabilities.
Learning rate, training epochs, and model sizes influence the effectiveness of ghost sentences.
Stats
11 out of 16 users with ghost sentences identify their data within the generation content.
61 out of 64 users with ghost sentences identify their data within the LLM output.