The content discusses the problem of multi-agent multi-armed bandit (MA-MAB) learning in the presence of heterogeneous action erasure channels. The key insights are:
In a MA-MAB setting, communication between the central learner and distributed agents can be hindered by action erasures due to channel delays or noise. This can lead to misguided feedback and poor learning performance.
The authors introduce the BatchSP2 algorithm that addresses this challenge. It is based on a successive arm elimination approach with a carefully designed repetition and scheduling protocol.
BatchSP2 achieves sub-linear regret guarantees, in contrast to existing bandit algorithms that experience linear regret under action erasures.
The algorithm works by repeating action requests multiple times to ensure high probability of successful delivery, and scheduling the action pulls across heterogeneous channels to minimize the overall learning time.
The regret analysis shows that BatchSP2 can recover existing optimal regret bounds as special cases, and provides instance-dependent bounds that adapt to the suboptimality gaps and erasure probabilities.
Numerical experiments demonstrate the superior performance of BatchSP2 compared to baseline approaches that are oblivious to the action erasure challenges.
翻译成其他语言
从原文生成
arxiv.org
更深入的查询