Core Concepts
The core message of this paper is to propose a new algorithm, FGTS.CDB, for linear contextual dueling bandits based on the Feel-Good Thompson sampling technique. The algorithm achieves a nearly minimax-optimal regret bound of e
O(d√T), where d is the feature dimensionality and T is the number of rounds.
Abstract
The paper proposes a new algorithm, FGTS.CDB, for the problem of linear contextual dueling bandits. The key aspects are:
The algorithm is based on the Feel-Good Thompson sampling technique, which is a variant of the standard Thompson sampling approach. The core idea is to introduce a new "Feel-Good" exploration term in the likelihood function that encourages exploration of arms with high rewards in previous rounds.
The Feel-Good exploration term in FGTS.CDB is specifically designed for the dueling bandit setting, containing an additional inner product term between the current model parameter and the feature vector of the adversarial arm. This term plays a crucial role in the analysis by eliminating cross terms that arise from the comparison of actions.
The authors prove that FGTS.CDB achieves a nearly minimax-optimal regret bound of e
O(d√T), matching the lower bound for linear contextual dueling bandits. This is the first algorithm based on Thompson sampling that can achieve such an optimal regret guarantee in this setting.
The authors also extend the analysis to the case of general nonlinear reward functions and recover the regret bound for several cases of interest, including finite action sets and finite model sets.
Experiments on synthetic data show that FGTS.CDB significantly outperforms existing UCB-based algorithms for contextual dueling bandits, including MaxInP, CoLSTIM, and VACDB.
Stats
The paper does not contain any explicit numerical data or statistics. The key results are the regret bounds derived for the proposed FGTS.CDB algorithm.