Core Concepts
The core message of this paper is to develop a reinforcement learning framework that learns policies that maximize reward while minimizing the disclosure of sensitive state variables through the agent's actions.
Abstract
The paper introduces a reinforcement learning (RL) framework that learns policies that maximize reward while minimizing the disclosure of sensitive state variables through the agent's actions. The key ideas are:
Formulate the problem as a constrained optimization problem, where the objective is to maximize reward while constraining the mutual information between the agent's actions and the sensitive state variables.
Develop several gradient estimators to efficiently optimize this constrained objective, including a model-based estimator, a model-free upper bound estimator, and a reparameterization-based estimator for differentiable environments.
Demonstrate the effectiveness of the approach on a variety of tasks, including a tabular web connection problem, a 2D continuous control task, and high-dimensional simulated robotics tasks. The learned policies are able to effectively hide the sensitive state variables while maintaining high reward.
Compare the approach to differentially private RL and a previous mutual information regularization method, showing the advantages of the proposed mutual information constraint formulation.
The key insight is that by directly optimizing the mutual information between actions and sensitive state, the agent can learn policies that intelligently plan ahead to reduce information disclosure, going beyond simply adding noise. This allows achieving high reward while satisfying strong privacy guarantees.
Stats
The paper does not contain any explicit numerical data or statistics to support the key claims. The results are presented qualitatively through visualizations of the learned policies and their behavior.
Quotes
"We formulate this privacy-constrained RL problem as an optimization problem with an additional regularizer on the mutual information between a function of the action at and a function of the protected state ut at each timestep t, induced under the learned policy q."
"Optimizing this regularizer is not straightforward since it is distribution-dependent (unlike the reward), and involves marginalization over the non-sensitive state."
"Experiments show that our constrained optimization finds the optimal privacy-constrained policy in an illustrative tabular environment and hides sensitive state in a continuous control problem. Finally, we show that the reparameterized estimator can find policies which effectively hide the sensitive state in high-dimensional (simulated) robotics tasks."