Conceitos Básicos
This research paper introduces DOPE+, a novel algorithm for safe reinforcement learning in constrained Markov decision processes (CMDPs), which achieves an improved regret upper bound while guaranteeing no constraint violation during the learning process.
Estatísticas
The DOPE+ algorithm achieves a regret upper bound of e
O(( ¯C −¯Cb)−1H2.5√
S2AK).
DOPE+ guarantees zero hard constraint violation in every episode.
When ¯C −¯Cb = Ω(H), the regret upper bound nearly matches the lower bound of Ω(H1.5√
SAK).