The paper considers a stochastic multi-armed bandit problem where the decision maker (DM) has access to an offline dataset before the online learning phase begins. The offline dataset is governed by a different probability distribution than the online rewards.
The key insights are:
The DM cannot outperform the vanilla UCB policy using any non-anticipatory policy without additional information about the difference between the offline and online reward distributions.
To bypass this impossibility result, the authors propose the MIN-UCB policy that adaptively chooses to utilize the offline data when they are deemed informative, and to ignore them otherwise. MIN-UCB requires an auxiliary input, a valid bias bound, which serves as an upper bound on the difference between the offline and online reward distributions.
The authors establish both instance dependent and instance independent regret bounds for MIN-UCB. They show that MIN-UCB outperforms the vanilla UCB when the offline and online reward distributions are "sufficiently close", and matches the performance of vanilla UCB when they are "far apart".
The authors also provide matching regret lower bounds, establishing the tightness of their analysis. In the special case when the offline and online reward distributions are identical, MIN-UCB achieves the optimal regret bound.
Numerical experiments corroborate the theoretical findings, demonstrating the robustness of MIN-UCB in adapting to the quality of the offline data.
To Another Language
from source content
arxiv.org
Viktige innsikter hentet fra
by Wang Chi Che... klokken arxiv.org 05-07-2024
https://arxiv.org/pdf/2405.02594.pdfDypere Spørsmål