核心概念
Decision-estimation coefficients optimize exploration-exploitation trade-offs in sequential decision-making.
要約
The content discusses regret minimization in sequential decision-making, focusing on saddle point optimization. It introduces the decision-estimation coefficient (DEC) and its variants, such as the average-constrained DEC. The algorithm ANYTIME-E2D is presented, optimizing the DEC online for structured observations. Connections to information ratio and decoupling coefficient are explored, along with empirical results on linear bandits. Computational aspects and upper bounds are discussed, emphasizing practical implementations.
-
Introduction
- Regret minimization is crucial in bandits and reinforcement learning.
- Balancing exploration-exploitation trade-off is key in sequential decision-making.
-
Regret Minimization via Saddle-Point Optimization
- Sample complexity of regret minimization characterized by min-max programs.
- Decision-estimation coefficient (DEC) optimizes exploration-exploitation trade-off.
- Introduction of ANYTIME-E2D algorithm for practical implementation.
-
Related Work
- Various approaches to regret minimization in bandits and reinforcement learning.
- Saddle-point problem utilized for optimal regret bounds.
-
Setting
- Decision-making problem defined with compact decision space Π and observation space O.
- Models associated with reward functions and observation distributions considered.
-
Regret Minimization via Saddle-Point Optimization
- Learner aims to minimize gap between decisions under true model f∗.
- Information function used to quantify statistical evidence against models g ≠ f∗.
-
The Decision-Estimation Coefficient
- DEC introduced as a min-max game between learner and environment.
- Constrained DEC parametrized via confidence radius ϵ for online optimization.
-
Anytime Estimation-To-Decisions (Anytime-E2D)
- E2D algorithm leverages average-constrained DEC for decision-making.
- Regret bounds derived based on estimation error and worst-case DEC.
-
Certifying Upper Bounds
- Information ratio and decoupling coefficient used to bound decision-estimation coefficients.
-
Application to Linear Feedback Models
- Improved regret bounds demonstrated for linear bandits with side-observations.
- Incremental scheme proposed for iterative computation of sampling distribution.
-
Conclusion
- ANYTIME-E2D algorithm enhances regret minimization through structured observations.
- Implementation details provided for finite and linear model classes.
統計
"By reparametrizing the offset DEC with the confidence radius..."
"The learner’s objective is to collect as much reward as possible..."
"The literature studies regret minimization for various objectives..."
引用
"In other words, a learner will inevitably face the exploration-exploitation trade-off where it must balance collecting rewards and collecting information."