Keskeiset käsitteet
The WSU-UX algorithm, a natural choice for incentive-compatible online learning with bandit feedback, cannot achieve regret better than Ω(T^2/3) in the worst case.
Tiivistelmä
The paper analyzes the regret of the WSU-UX algorithm, which is an incentive-compatible online learning algorithm designed for the setting of prediction with selfish (reputation-seeking) experts under bandit feedback.
Key highlights:
- The authors show that for any valid choice of hyperparameters (learning rate η and exploration rate γ), there exists a loss sequence for which the expected regret of WSU-UX is Ω(T^2/3).
- This implies that the O(T^2/3) regret bound shown by the original authors is tight and cannot be improved, suggesting that learning with reputation-seeking experts under bandit feedback is strictly harder than the classical bandit problem.
- The proof involves a careful analysis of the probability updates in WSU-UX, leveraging a recent multiplicative form of Azuma's inequality to show that the algorithm cannot concentrate on the best expert quickly enough to achieve better regret.
- The authors also provide a high-level overview of the proof, highlighting the key technical challenges in establishing the lower bound.