Core Concepts
Establishing minimax optimal sample complexity for learning ε-optimal policies in average-reward Markov decision processes based on the span of the bias function.
Abstract
This content delves into the sample complexity analysis of learning ε-optimal policies in average-reward Markov decision processes (MDPs) under a generative model. The study establishes a complexity bound of eO(SAHε^2), where H is the span of the bias function of the optimal policy and SA is the cardinality of the state-action space. The results improve upon existing work by providing minimax optimal bounds in all parameters S, A, H, and ε. The article discusses reductions from average-reward to discounted MDPs and presents algorithms with variance-dependent guarantees for solving these problems efficiently.
Structure:
Abstract and Introduction: Discusses reinforcement learning paradigms and theoretical challenges in RL.
Data Extraction: Key metrics supporting sample complexity analysis are provided.
Quotations: Striking quotes supporting key logics are included.
Inquiry and Critical Thinking: Questions to deepen understanding and encourage critical thinking are posed.
Stats
Our result establishes a complexity bound eO(SAHε^2).
Samples suffice to learn an ε-optimal policy in weakly communicating MDPs under certain conditions.
Quotes
Our result is based on reducing the average-reward MDP to a discounted MDP.
Our approach sheds greater light on the relationship between average-reward and discounted MDPs.