The paper studies decentralized online learning in general-sum Stackelberg games, where players act in a strategic and decentralized manner. It considers two settings based on the information available to the follower:
Limited information setting: The follower only observes its own reward and cannot manipulate the leader. The paper shows that the follower's best strategy is to myopically best respond to the leader's actions, and derives last-iterate convergence results when both players use no-regret learning algorithms.
Side information setting: The follower has extra information about the leader's reward structure. The paper designs a manipulation strategy called FBM for the omniscient follower, and shows that it can gain an intrinsic advantage over the best response strategy. It then extends this to the case of noisy side information, where the follower learns the leader's reward from bandit feedback, and proposes FMUCB, a variant of FBM. The paper derives the sample complexity and last-iterate convergence results for FMUCB.
The key insights are: (1) In the limited information setting, no-regret learning algorithms lead to convergence of Stackelberg equilibrium. (2) In the side information setting, the follower's manipulation strategy can induce a more favorable equilibrium compared to the Stackelberg equilibrium.
翻譯成其他語言
從原文內容
arxiv.org
深入探究