Leveraging Offline Data in Multi-armed Bandits: Adaptive Policy for Biased Information
The core message of this paper is to design an adaptive policy, MIN-UCB, that can effectively leverage offline data to improve online learning in stochastic multi-armed bandits, even when the offline data is biased and the difference between the offline and online reward distributions is unknown.