核心概念
Establishing optimal and adaptive regret bounds for non-stationary dueling bandits under the Borda criterion.
要約
This content delves into the optimization of algorithms for non-stationary dueling bandits, focusing on the Borda criterion. It introduces a novel framework of generalized Borda scores, unifying Condorcet and Borda regret minimization tasks. The analysis highlights key insights into tracking changes in winner arms adaptively without prior knowledge of non-stationarity. Various algorithmic designs are discussed to achieve optimal dynamic regret bounds.
Structure:
- Introduction to Dueling Bandits (K-armed problem with relative feedback).
- Setup - Non-stationary Dueling Bandits.
- Dynamic Regret Lower Bounds (Borda and Condorcet criteria).
- Dynamic Regret Upper Bounds (Borda and Condorcet formulations).
- A New Unified View of Condorcet and Borda Regret (Generalized Borda scores).
- Algorithmic Design - Base Algorithm and Meta-Algorithm for non-stationary settings.
- Novelties in Condorcet Regret Analysis (Recasting as Generalized Borda Regret).
- Conclusion and Future Questions.
統計
Saha and Gupta (2022) provided an algorithm achieving nearly optimal Condorcet dynamic regret of ˜O(√KLT), requiring knowledge of the total number of changes L.
Buening and Saha (2023) proposed a notion of significant Condorcet winner switches, achieving an adaptive dynamic regret bound of ˜O(Kp˜LT) under certain conditions.
Suk and Agarwal (2023) showed that tighter measures of non-stationarity can be learned adaptively outside SST∩STI assumptions in dueling bandits.
引用
"Surprisingly, our techniques for non-stationary Borda dueling bandits also yield improved rates within the Condorcet winner setting."
"Our focus in this work is on the more challenging non-stationary dueling bandits where preferences may change over time."
"We introduce a new unified framework, that of generalized Borda scores, for dueling bandits which generalizes both the Condorcet and Borda problems."