Core Concepts
The author introduces the concept of incorporating epsilon-greedy policy into Thompson Sampling to enhance its exploitation capabilities in Bayesian optimization.
Abstract
In this study, the authors explore the integration of epsilon-greedy policy with Thompson Sampling to balance exploration and exploitation in Bayesian optimization. The research focuses on improving the performance of Thompson Sampling by randomly switching between exploration and exploitation strategies based on epsilon values. By incorporating epsilon-greedy policy, the study aims to optimize costly objective functions efficiently. The empirical evaluations demonstrate that epsilon-greedy Thompson Sampling, with an appropriate epsilon value, outperforms traditional methods and competes effectively with other approaches.
Thompson sampling (TS) is a stochastic policy used to address the exploitation-exploration dilemma in multi-armed bandit problems. When applied to Bayesian optimization (BO), TS generates input variable points through random sampling from unknown posterior distributions. The study introduces two extremes of TS for BO: generic TS and sample-average TS, focusing on exploration and exploitation, respectively. By incorporating the epsilon-greedy policy, which randomly switches between these extremes based on a small value of epsilon (ε), the study aims to improve the exploitation strategy of TS.
The research highlights that a proper selection of ε can significantly impact the performance of epsilon-greedy TS. Additionally, varying Ns values show how different numbers of sample paths affect optimization results. The computational cost analysis indicates that for suitable ε values and a sufficient number of sample paths, the method's efficiency is comparable to traditional approaches.
Overall, this study provides valuable insights into enhancing Bayesian optimization techniques by integrating reinforcement learning strategies like epsilon-greedy policies with existing methodologies.
Stats
Given a dataset consisting of several observations of input variables and objective functions.
A GP posterior built from this dataset often serves as a probabilistic model representing beliefs about the objective function.
Several notable acquisition functions are developed to balance exploitation and exploration.
In each iteration, TS selects an arm from a set of finite arms corresponding to stochastic rewards.
The global minimum location is fully determined by the objective function when using generic TS for BO.
Quotes
"The goal is to craft a sequence of arms that maximizes cumulative reward under assumption that rewards are independent." - Author
"Several works have introduced ε-greedy policy to BO and multi-armed bandit problems." - Author