핵심 개념
The author explores the efficiency of slowly changing adversarial bandit algorithms in discounted Markov Decision Processes, showing that optimal regret can be achieved. The approach involves a reduction from tabular reinforcement learning to multi-armed bandits.
초록
The content discusses the application of slowly changing adversarial bandit algorithms in discounted Markov Decision Processes (MDPs). It explores a reduction from reinforcement learning to bandits, addressing challenges such as objective mismatch and sticky bandits. The analysis includes related works, assumptions, and a case study on EXP3 algorithm.
The work highlights the importance of assumptions for exploration in MDPs and suggests potential improvements to reduce dependencies on parameters like state space size and effective horizon. Future directions include exploring stochastic bandit algorithms and extending the framework to function approximation implementations.
통계
Regret(T) = ˜O(τ 2H4S(S+A)(1−γ)2β3√T)
Rfs-mdp(T) = ˜O(H2.5Sf(A)(1−γ)β√T + τ 2H4S(S+A)(1−γ)β3cTT)
인용구
"We explore a black-box reduction from discounted infinite-horizon tabular reinforcement learning to multi-armed bandits."
"Any slowly changing adversarial bandit algorithm achieving optimal regret in the adversarial setting can also attain optimal expected regret in infinite-horizon discounted MDPs."
"Our main result requires that the bandits placed in each state are slowly changing."