The author explores the efficiency of slowly changing adversarial bandit algorithms in discounted Markov Decision Processes, showing that optimal regret can be achieved. The approach involves a reduction from tabular reinforcement learning to multi-armed bandits.
Slowly changing adversarial bandit algorithms can efficiently handle discounted Markov decision processes.