toplogo
Sign In

Improved Theoretical Guarantees for Thompson Sampling in Stochastic Bandits


Core Concepts
We derive a new problem-dependent regret bound for Thompson Sampling with Gaussian priors that significantly improves the existing bound. Additionally, we propose two parameterized Thompson Sampling-based algorithms, TS-MA-α and TS-TD-α, that achieve a favorable trade-off between utility (regret) and computation (number of drawn posterior samples).
Abstract
The paper studies Thompson Sampling-based algorithms for stochastic multi-armed bandits with bounded rewards. Key highlights: The existing problem-dependent regret bound for Thompson Sampling with Gaussian priors is vacuous when the learning horizon T is small. The paper derives a new bound that tightens the coefficient of the leading term to 1270. The paper shows that Thompson Sampling follows the principle of optimism in the face of uncertainty, similar to UCB-based algorithms. Motivated by large-scale real-world applications, the paper proposes two parameterized Thompson Sampling-based algorithms, TS-MA-α and TS-TD-α, that achieve a favorable trade-off between utility (regret) and computation (number of drawn posterior samples). TS-MA-α draws a batch of posterior samples at once rather than drawing independent samples in each round. It achieves a O(K ln^(1+α)(T)/Δ) regret bound while drawing fewer than KT samples. TS-TD-α is an adaptive switching algorithm between Thompson Sampling and TS-MA-α. It draws posterior samples more frequently for optimal arms and avoids drawing samples for sub-optimal arms, leading to significant computational savings.
Stats
The paper presents the following key statistics: The existing problem-dependent regret bound for Thompson Sampling with Gaussian priors has a coefficient of at least 288e^64 ≈ 1.8 × 10^30 for the leading term. The new problem-dependent regret bound derived in the paper has a coefficient of 1270 for the leading term.
Quotes
"The key advantage of TS-MA-α is that the total number of drawn samples does not depend on the number of arms K. Since the total amount of data-dependent samples does not depend on K, TS-MA-α is extremely efficient when the number of arms is very large." "The key advantage of TS-TD-α is that it is extremely efficient when the learning problem has many sub-optimal arms, as it stops drawing data-dependent samples for the sub-optimal arms."

Deeper Inquiries

How can the proposed algorithms be extended to achieve anytime guarantees and optimal worst-case regret bounds

To extend the proposed algorithms to achieve anytime guarantees and optimal worst-case regret bounds, we can explore adaptive strategies that dynamically adjust the exploration-exploitation trade-off based on the learning progress. One approach could involve incorporating a mechanism that gradually increases exploration when the algorithm is not performing well and decreases it when the performance is satisfactory. By continuously monitoring the regret and adjusting the exploration parameter, the algorithm can adapt to different scenarios and aim for optimal performance in both the worst-case and average-case scenarios. Additionally, incorporating techniques like optimism in the face of uncertainty (OFU) can help in achieving anytime guarantees by ensuring that the algorithm continues to explore even after convergence to avoid suboptimal solutions.

Can the ideas behind ϵ-TS be generalized to other reward distributions beyond the exponential family

The ideas behind ϵ-TS can be generalized to other reward distributions beyond the exponential family by adapting the algorithm to accommodate different types of reward distributions. One approach could involve designing a flexible framework that allows for the selection of appropriate conjugate priors and likelihood functions based on the characteristics of the reward distribution. By tailoring the algorithm to the specific properties of the reward distribution, such as boundedness or sub-Gaussianity, it can be extended to a wider range of distributions while maintaining its performance guarantees. Additionally, exploring the use of non-parametric models or distribution-free approaches can provide a more universal solution that is not limited to specific families of reward distributions.

Is it possible for Thompson Sampling with Gaussian priors to achieve asymptotic optimality for bounded rewards, or is there a fundamental limitation

It is unlikely for Thompson Sampling with Gaussian priors to achieve asymptotic optimality for bounded rewards due to fundamental limitations in the modeling of the reward distributions. Gaussian priors are unbounded distributions, which may not capture the characteristics of bounded rewards effectively. As a result, the posterior distributions may not converge optimally for bounded rewards, leading to suboptimal performance in the long run. To address this limitation, alternative prior distributions that better reflect the bounded nature of the rewards, such as Beta priors or truncated Gaussian priors, may be more suitable. By choosing priors that align with the properties of the reward distributions, the algorithm can improve its performance and potentially achieve asymptotic optimality for bounded rewards.
0
visual_icon
generate_icon
translate_icon
scholar_search_icon
star