The paper addresses the challenge of constrained reinforcement learning (RL) under model mismatch. Existing studies on constrained RL may obtain a well-performing policy in the training environment, but when deployed in a real environment, it may easily violate constraints that were originally satisfied during training due to model mismatch.
To address this challenge, the authors formulate the problem as constrained RL under model uncertainty. The goal is to learn a good policy that optimizes the reward and at the same time satisfies the constraint under model mismatch.
The authors develop a Robust Constrained Policy Optimization (RCPO) algorithm, which is the first algorithm that applies to large/continuous state space and has theoretical guarantees on worst-case reward improvement and constraint violation at each iteration during the training.
The key technical developments include:
The authors demonstrate the effectiveness of their algorithm on a set of RL tasks with constraints, including tabular and deep learning cases.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문