Core Concepts
This paper presents GTT23, the first dataset of genuine Tor traces, which enables more realistic evaluation of website fingerprinting attacks and defenses compared to existing synthetic datasets.
Abstract
The authors designed a safe Tor network measurement methodology to collect GTT23, a dataset of 13,900,621 genuine Tor circuit traces. GTT23 is larger than any previous WF dataset by an order of magnitude and contains traces of real user behavior interacting with a diverse set of internet services at natural base rates, unlike existing synthetic datasets.
The analysis of GTT23 reveals several key insights about genuine Tor traffic:
96% of circuits use ports 80, 8080, or 443 for the first connection, indicating predominant web traffic
Most circuits are short, with a median of just 25 cells (< 10.5 KB), suggesting many circuits may fail prematurely
Circuit lengths exhibit high variability, with over half of domains having a circuit length standard deviation greater than the mean
The distribution of circuits per domain follows a power-law, with a long tail of rarely accessed sites
The authors compare GTT23 to 25 synthetic WF datasets published over the last 15 years, finding that existing datasets suffer from common deficiencies like focusing only on web traffic, using simplistic user models and tools, and not reflecting realistic base rates. In contrast, GTT23 provides a more accurate representation of genuine Tor user behavior and website access patterns, enabling more meaningful evaluation of WF attacks and defenses.
Stats
The median number of cells per circuit in GTT23 is 25, corresponding to at most 10.5 KB of application payload.
Over 90% of webpages have a transfer size greater than 450 KB according to the HTTP Archive.
The distribution of circuits per domain in GTT23 follows a power-law, with 80% of domains having just a single circuit measured.
Quotes
"GTT23 contains genuine traces of websites accessed by real Tor users at completely natural base rates, and thus it will enable WF evaluations that more accurately estimate real-world WF performance."
"Because GTT23 contains genuine traces, it is more realistic than any existing synthetic dataset, and thus it will enable WF evaluations that more accurately estimate real-world WF performance."