Core Concepts
Slowly changing adversarial bandit algorithms can efficiently handle discounted Markov decision processes.
Abstract
The content explores a reduction from tabular reinforcement learning to multi-armed bandits, focusing on slowly changing adversarial bandit algorithms. It discusses the black-box reduction, related work, preliminary concepts, assumptions, and challenges faced in the analysis. The main theorem and case study with EXP3 are presented to demonstrate the efficiency of bandit algorithms in handling discounted MDPs.
Stats
Under ergodicity and fast mixing assumptions, one could place slowly changing bandit algorithms to achieve optimal regret bounds.
The regret bound depends on various problem parameters specified in later sections.
The slowly changing property is crucial for leveraging techniques from the bandit toolbox effectively.
Quotes
"Reinforcement learning generalizes multi-armed bandit problems with additional difficulties of a longer planning horizon and unknown transition dynamics."
"We explore a black-box reduction from discounted infinite-horizon tabular reinforcement learning to multi-armed bandits."
"Despite the decentralized framework where each state is managed by an independent learner being a compelling problem in itself..."