Optimal Regret Bounds for Contextual Bandits and Reinforcement Learning Exploration with EXP-based Algorithms
This work proposes a new algorithm, EXP4.P, that achieves optimal regret bounds for contextual bandits with both bounded and unbounded rewards. It also extends EXP4.P to reinforcement learning to incentivize exploration by multiple agents given black-box rewards.