Core Concepts
Web publishers are increasingly seeking ways to keep their content out of the training datasets for generative AI models in order to safeguard their intellectual property. New technical standards and opt-out protocols are emerging to empower publishers with finer control over how their data is used by AI applications.
Abstract
This paper provides a comprehensive overview of the legal and technical landscape surrounding web content control for generative AI.
The legal background section outlines the key intellectual property and data protection regulations in the EU and US that govern the use of web content, particularly in the context of text and data mining (TDM) activities. The EU's 2019 DSM Directive offers a regulatory framework that allows rightsholders to opt out of certain TDM uses through machine-readable formats.
The technical background section introduces the Robots Exclusion Protocol (REP) as the dominant mechanism for content control on the web, as well as various past and present initiatives to extend or refine this protocol, such as ACAP, RightsML, and C2PA. However, these existing standards have limitations in addressing the specific needs of web publishers in the face of generative AI advancements.
The paper then evaluates six recent ad hoc standards that have emerged to provide opt-out mechanisms for web publishers, including enhancements to robots.txt, usage-specific user agents, the learners.txt file, new meta tags, image metadata, and the TDM Reservation Protocol. These proposals are assessed based on their technique, level of granularity, associated terms, and scope of the opt-out.
An empirical study is also presented, which examines the current adoption rates of these various opt-out approaches across a large sample of websites. The findings suggest that while some initiatives, such as usage-specific user agents, have seen notable adoption, the overall landscape remains fragmented, with most ad hoc standards still struggling to gain widespread traction among web publishers.
The paper concludes that the current technical solutions are either idealistic and poorly adopted, or specific to certain AI applications, leaving webmasters with the overwhelming task of implementing multiple micro-standards to protect their data from unwanted AI/ML-related use. The need for a more comprehensive and widely accepted standard remains a key challenge in this evolving landscape of web content control.