Core Concepts
Slowly changing adversarial bandit algorithms can efficiently handle discounted Markov decision processes.
Abstract
The article explores a reduction from discounted infinite-horizon tabular reinforcement learning to multi-armed bandits.
It discusses the challenges of RL compared to MAB and the potential to close the complexity gap.
The reduction involves placing an independent bandit learner in each state.
The slowly changing property of bandit algorithms is crucial for optimal performance.
Techniques from the bandit toolbox are leveraged for handling delayed feedback.
The article connects the reduction to multi-agent RL and Monte Carlo methods.
Stats
"We prove that, under ergodicity and fast mixing assumptions, one could trivially place ˜O(S)1 arbitrary slowly changing bandit algorithms to achieve a regret bound of ˜O(poly(S, A, H, τ, 1 β, 1 1−γ ) · ( √ T + cT T)) (which depends on various problem parameters specified in later sections), if the bandit learners are optimal in the adversarial bandit setting."
"The regret bound is optimal with respect to T (up to polylogarithmic factors) when cT is ˜O(1/ √ T), which is a mild requirement as discussed in later sections."
Quotes
"We show how our reduction framework effectively handles delayed feedback, benefiting from the robustness of adversarial bandits to such feedback."
"Understanding the reduction to independent learners can be connected to multi-agent RL, where such decentralization allows mitigating the curse of multiagency."
"Our analysis relies on using bandits in our algorithm that themselves are slowly changing."