Comprehensive Analysis of Over 1 Million Web Privacy Policies: Insights into Online Privacy Practices at Scale
The PrivaSeer Corpus provides a comprehensive dataset of over 1 million contemporary website privacy policies, enabling large-scale analysis of online privacy practices. The corpus was created through a novel pipeline including web crawling, language detection, document classification, and near-duplicate removal. Analysis of the corpus reveals insights into the distribution of privacy policy topics, the relationship between domain popularity and policy coverage, and the overall readability of privacy policies.