核心概念
This research paper introduces DOPE+, a novel algorithm for safe reinforcement learning in constrained Markov decision processes (CMDPs), which achieves an improved regret upper bound while guaranteeing no constraint violation during the learning process.
統計資料
The DOPE+ algorithm achieves a regret upper bound of e
O(( ¯C −¯Cb)−1H2.5√
S2AK).
DOPE+ guarantees zero hard constraint violation in every episode.
When ¯C −¯Cb = Ω(H), the regret upper bound nearly matches the lower bound of Ω(H1.5√
SAK).