Core Concepts
The goal is to learn a policy that maximizes the worst-case reward while satisfying a constraint on the worst-case utility, even under model mismatch between the training and real environments.
Abstract
The paper addresses the challenge of constrained reinforcement learning (RL) under model mismatch. Existing studies on constrained RL may obtain a well-performing policy in the training environment, but when deployed in a real environment, it may easily violate constraints that were originally satisfied during training due to model mismatch.
To address this challenge, the authors formulate the problem as constrained RL under model uncertainty. The goal is to learn a good policy that optimizes the reward and at the same time satisfies the constraint under model mismatch.
The authors develop a Robust Constrained Policy Optimization (RCPO) algorithm, which is the first algorithm that applies to large/continuous state space and has theoretical guarantees on worst-case reward improvement and constraint violation at each iteration during the training.
The key technical developments include:
A robust performance difference lemma that bounds the difference between the robust value functions of two policies.
A two-step approach that performs policy improvement followed by a projection step to ensure constraint satisfaction.
An efficient approximation for practical implementation of the algorithm.
The authors demonstrate the effectiveness of their algorithm on a set of RL tasks with constraints, including tabular and deep learning cases.
Stats
The paper does not provide any specific numerical data or statistics. It focuses on the theoretical development of the RCPO algorithm and its performance guarantees.
Quotes
"The goal is to learn a good policy that optimizes the reward and at the same time satisfy the constraint under model mismatch."
"Our RCPO algorithm can be applied to large scale problems with a continuous state space."