The paper addresses the challenge of computing robust policies for high-stakes decision-making problems with limited data in reinforcement learning. The authors focus on the percentile criterion, which aims to optimize the policy for the worst α-percentile transition probability model.
The key insights are:
Existing work uses Robust Markov Decision Processes (RMDPs) with Bayesian credible regions (BCR) as ambiguity sets to approximately solve the non-convex percentile criterion. However, the BCR ambiguity sets are often unnecessarily large, resulting in overly conservative policies.
The authors propose a novel Value-at-Risk (VaR) based dynamic programming framework to optimize a lower bound on the percentile criterion without explicitly constructing ambiguity sets. They show that the VaR Bellman operator is a valid contraction mapping that optimizes a tighter lower bound on the percentile criterion compared to RMDPs with BCR ambiguity sets.
The authors theoretically analyze the performance of the VaR framework and show that the ambiguity sets implicitly constructed by the VaR Bellman operator tend to be smaller than the BCR ambiguity sets, especially as the number of states increases.
The authors provide a Generalized VaR Value Iteration algorithm and analyze its error bounds. They also empirically demonstrate the efficacy of the VaR framework in three domains, showing that it outperforms various baseline methods in terms of robust performance.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Elita A. Lob... at arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.05055.pdfDeeper Inquiries