Kernekoncepter
The core message of this paper is to propose a novel Value-at-Risk (VaR) based dynamic programming framework for optimizing a tight lower bound on the percentile criterion in offline reinforcement learning, without explicitly constructing ambiguity sets. The authors show that the VaR Bellman operator implicitly constructs smaller ambiguity sets compared to Bayesian credible regions, leading to less conservative robust policies.
Resumé
The paper addresses the challenge of computing robust policies for high-stakes decision-making problems with limited data in reinforcement learning. The authors focus on the percentile criterion, which aims to optimize the policy for the worst α-percentile transition probability model.
The key insights are:
Existing work uses Robust Markov Decision Processes (RMDPs) with Bayesian credible regions (BCR) as ambiguity sets to approximately solve the non-convex percentile criterion. However, the BCR ambiguity sets are often unnecessarily large, resulting in overly conservative policies.
The authors propose a novel Value-at-Risk (VaR) based dynamic programming framework to optimize a lower bound on the percentile criterion without explicitly constructing ambiguity sets. They show that the VaR Bellman operator is a valid contraction mapping that optimizes a tighter lower bound on the percentile criterion compared to RMDPs with BCR ambiguity sets.
The authors theoretically analyze the performance of the VaR framework and show that the ambiguity sets implicitly constructed by the VaR Bellman operator tend to be smaller than the BCR ambiguity sets, especially as the number of states increases.
The authors provide a Generalized VaR Value Iteration algorithm and analyze its error bounds. They also empirically demonstrate the efficacy of the VaR framework in three domains, showing that it outperforms various baseline methods in terms of robust performance.
Statistik
The number of states in the Riverswim MDP is 5.
The number of states in the Population Growth MDP is 50.
The number of states in the Inventory Management MDP is 30.