Core Concepts
The core message of this paper is to study the problem of adversarial combinatorial bandits with switching costs, derive lower bounds for the minimax regret, and propose algorithms that approximately meet these lower bounds under both bandit feedback and semi-bandit feedback settings.
Abstract
The paper studies the problem of adversarial combinatorial bandits with switching costs, where there is a cost λ > 0 for switching between arms in each round. The authors consider both the bandit feedback setting, where only the total loss of the chosen combinatorial arm is observed, and the semi-bandit feedback setting, where the losses of all base arms in the chosen combinatorial arm are observed.
The key contributions are:
The authors derive lower bounds for the minimax regret under both feedback settings. For bandit feedback, the lower bound is Ω((λK)^(1/3)(TI)^(2/3)/log^2 T), and for semi-bandit feedback, the lower bound is Ω((λKI)^(1/3)T^(2/3)/log^2 T), where K is the number of base arms, I is the number of base arms in the combinatorial arm, and T is the time horizon.
To approach these lower bounds, the authors design two algorithms:
For bandit feedback, the BATCHED-EXP2 algorithm with John's exploration achieves a regret upper bound of Õ((λK)^(1/3)T^(2/3)I^(4/3)).
For semi-bandit feedback, the BATCHED-BROAD algorithm achieves a regret upper bound of Õ((λK)^(1/3)(TI)^(2/3) + KI).
The authors show that the regret gap between their algorithms and the lower bounds scales at most as I^(2/3) for bandit feedback and I^(1/3) for semi-bandit feedback, indicating that further improvements may be possible.
Stats
K ≥ 3I and T ≥ max{λK/I, 8} for the lower bound under bandit feedback.
K ≥ 3I and T ≥ max{λK/I^2, 6} for the lower bound under semi-bandit feedback.