The paper analyzes a framework for studying settings in which exploration and exploitation are untangled. The agent encounters a recurring decision between two uncertain projects and can learn about the projects through exploration, but not through exploitation.
The key findings are:
When exploration and exploitation are disentangled, the agent exploits the best project asymptotically, in contrast to the classical multi-armed bandit setting where the agent's exploitation need not converge to the ex-post optimal project.
In the case of one safe project, the optimal exploitation strategy involves setting a threshold on the posterior probability of the risky project's favorability. The threshold depends only on the maximum between the arrival rates of good and bad news.
In the case of two risky projects with fully disentangled exploration and exploitation, the optimal exploration strategy cannot be characterized by an index policy à la Gittins. The optimal policy exhibits a lot of persistence - the agent switches the explored project at most once absent news arrival.
The payoff benefits of disentanglement are most pronounced when parameters fall within intermediate ranges, such as the discount rate, arrival rates of news, and initial beliefs regarding the viability of the projects.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Alessandro L... at arxiv.org 05-01-2024
https://arxiv.org/pdf/2404.19116.pdfDeeper Inquiries