toplogo
Sign In

Fingerprinting Users Through the Topics API: A Privacy Analysis on Real Browsing Data


Core Concepts
The Topics API for the web, Google's proposed alternative to third-party cookies, does not provide the same privacy guarantees to all users. On a dataset of 1207 real users, 60% can be uniquely re-identified across websites after just 3 observations of their topics by advertisers.
Abstract
This paper provides a reproducible privacy analysis of the latest version of the Topics API on the largest publicly available dataset of real browsing histories. Key findings: Users' topics profiles are highly stable and unique, with 47% of users having 3 or more topics in common in their top 5 across consecutive weeks, and 93% having unique top 5 topic profiles each week. It is possible to identify some of the "noisy" topics added by the API to provide plausible deniability, with 10% precision using a simple heuristic. In practice, 46%, 55%, and 60% of the 1207 real users can be uniquely re-identified across 2 websites after 1, 2, and 3 observations of their topics by advertisers, respectively. The paper highlights the importance of public and reproducible evaluations of new web proposals like the Topics API, to identify potential limitations during the design phase rather than after deployment. It calls on web actors to release anonymized or synthetic datasets to enable further analyses.
Stats
46% of users can be uniquely re-identified across 2 websites after 1 observation of their topics 55% of users can be uniquely re-identified across 2 websites after 2 observations of their topics 60% of users can be uniquely re-identified across 2 websites after 3 observations of their topics
Quotes
"46%, 55%, and 60% of the 1207 real users can be uniquely re-identified across 2 websites after 1, 2, and 3 observations of their topics by advertisers, respectively." "This paper highlights the importance of public and reproducible evaluations of any claim made by new web proposals and to identify the potential limitations of these techniques during their design rather than after their deployment."

Key Insights Distilled From

by Yohan Beugin... at arxiv.org 03-29-2024

https://arxiv.org/pdf/2403.19577.pdf
A Public and Reproducible Assessment of the Topics API on Real Data

Deeper Inquiries

What other techniques could be used to identify noisy topics returned by the Topics API and improve the plausible deniability it aims to provide?

To identify noisy topics returned by the Topics API and enhance plausible deniability, additional techniques can be employed. One approach could involve analyzing the frequency of topics across different users to determine if certain topics are consistently rare or uncommon. By establishing a threshold based on the frequency of topics in the dataset, topics that fall below this threshold could be flagged as potentially noisy. Furthermore, leveraging machine learning algorithms to detect patterns in topic distributions and classify topics as noisy or real based on these patterns could be beneficial. Natural Language Processing (NLP) techniques could also be utilized to analyze the semantic content of topics and identify inconsistencies or anomalies that may indicate noisy topics. Additionally, collaborative filtering methods could be applied to compare topics observed for a user with those observed for similar users, helping to distinguish between genuine and noisy topics.

How would the privacy guarantees of the Topics API change if the witness requirement was enforced for API callers?

Enforcing the witness requirement for API callers in the Topics API would significantly impact the privacy guarantees provided to users. By implementing the witness requirement, API callers would only receive real topics if they were embedded on a website of the same topic that was visited by the user during the specified epochs. This would enhance user privacy by limiting the information disclosed to advertisers and reducing the risk of re-identification across websites. With the witness requirement in place, advertisers would have a more restricted view of users' browsing behaviors, making it more challenging to track and identify individual users based on their topics of interest. Overall, enforcing the witness requirement would strengthen the privacy protections of the Topics API and enhance user anonymity.

How could the Topics API be redesigned to better balance the trade-off between utility for advertisers and privacy for users?

To better balance the trade-off between utility for advertisers and privacy for users, several redesign strategies could be considered for the Topics API. One approach could involve implementing differential privacy techniques to add noise to the topics returned to advertisers, thereby protecting user privacy while still providing valuable information for ad targeting. By introducing noise in a controlled manner, the API could offer advertisers useful insights without compromising individual user privacy. Additionally, enhancing the transparency and control mechanisms for users to manage their topic preferences and opt-out of certain categories could empower users to protect their privacy while still receiving relevant ads. Furthermore, incorporating advanced encryption methods to secure the transmission and storage of user data within the Topics API could bolster privacy protections. By encrypting user topics and implementing secure data handling practices, the API could minimize the risk of data breaches and unauthorized access to sensitive information. Moreover, conducting regular privacy impact assessments and audits to evaluate the effectiveness of privacy safeguards and identify areas for improvement could help ensure that the Topics API maintains a balance between utility and privacy for all stakeholders.
0