Core Concepts
Optimizing regret minimization through saddle point optimization in sequential decision-making.
Abstract
The article discusses regret minimization in bandits and reinforcement learning, emphasizing the exploration-exploitation trade-off. It introduces the decision-estimation coefficient (DEC) and its variants, such as the average-constrained DEC. The ANYTIME-E2D algorithm is presented, optimizing the exploration-exploitation trade-off online. Connections to information ratio, decoupling coefficient, and PAC-DEC are highlighted. The algorithm's performance is evaluated on simple examples, showing improvements for linear bandits with side-observations.
Stats
"37th Conference on Neural Information Processing Systems (NeurIPS 2023)"
"15 Mar 2024"
Quotes
"A long line of works characterizes the sample complexity of regret minimization in sequential decision-making by min-max programs."
"The learner’s objective is to collect as much reward as possible in n steps when facing a model f ∗ ∈ M."