The paper studies multi-stage systems where each job needs to go through multiple stages, each managed by a different agent. The agents can only control their own actions and learn the final outcome of the job, without knowledge or control over the actions taken by agents in the next stage.
The key highlights and insights are:
In addition to the exploration-exploitation dilemma in traditional multi-armed bandit problems, multi-stage systems introduce a third component - education, where an agent needs to choose actions to facilitate the learning of agents in the next stage.
The paper proposes a distributed online learning algorithm called ε-EXP3 that explicitly addresses the exploration-exploitation-education trilemma. ε-EXP3 has two modes - a uniform selection mode for education and an EXP3 mode for exploration-exploitation.
The paper theoretically proves that ε-EXP3 achieves sublinear regret, with the regret scaling as O(T^(L/(L+1))), where L is the depth of the system.
Simulation results show that ε-EXP3 significantly outperforms existing no-regret algorithms designed for traditional multi-armed bandit problems when applied to multi-stage systems.
To Another Language
from source content
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by I-Hong Hou lúc arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.04509.pdfYêu cầu sâu hơn