The study focuses on online learning in constrained Markov decision processes (CMDPs) with adversarial losses and stochastic hard constraints. Two scenarios are considered: one addressing cumulative positive constraints violation and sublinear regret, and the other ensuring constraints satisfaction at every episode. The algorithms designed, BV-OPS and S-OPS, provide solutions for handling adversarial losses and constraints in non-stationary environments. The work expands the applicability of algorithms to various real-world applications.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문