핵심 개념
Equivariant network architectures can effectively leverage environmental symmetry to improve zero-shot coordination between independently trained agents in decentralized partially observable Markov decision processes.
초록
The paper presents a novel equivariant network architecture, called the Equivariant Coordinator (EQC), for use in decentralized partially observable Markov decision processes (Dec-POMDPs) to improve zero-shot coordination between independently trained agents.
Key highlights:
- EQC mathematically guarantees symmetry-equivariance of multi-agent policies, and can be applied solely at test time as a coordination-improvement operator.
- EQC outperforms prior symmetry-robust baselines on the AI benchmark Hanabi.
- EQC can be used to improve the coordination ability of a variety of pre-trained policies, including the state-of-the-art for zero-shot coordination on Hanabi.
- The authors provide theoretical guarantees for the properties of EQC and demonstrate its empirical efficacy through extensive experiments.
The paper first formalizes the problem setting of Dec-POMDPs and introduces relevant group-theoretic concepts like equivariance. It then presents the EQC architecture and discusses its mathematical properties. The authors propose two algorithmic approaches using EQC: 1) training agents with a group-based Other-Play learning rule, then symmetrizing them at test time, and 2) symmetrizing pre-trained agents at test time.
The experiments on the Hanabi benchmark show that both approaches outperform prior symmetry-aware baselines in zero-shot coordination. The authors also demonstrate that EQC can be used as a coordination-improvement operator to enhance the performance of diverse pre-trained policies, including the state-of-the-art for zero-shot coordination on Hanabi.
통계
The paper does not contain any explicit numerical data or statistics. The key results are presented in the form of comparative performance metrics between different agent types on the Hanabi benchmark.
인용구
"Equivariant policies are such that symmetric changes to their observation cause a corresponding change to their output. In doing so, we fundamentally prevent the agent from breaking symmetries over the course of training."
"Our method also acts as a "coordination-improvement operator" for generic, pre-trained policies, and thus may be applied at test-time in conjunction with any self-play algorithm."
"We provide theoretical guarantees of our work and test on the AI benchmark task of Hanabi, where we demonstrate our methods outperforming other symmetry-aware baselines in zero-shot coordination, as well as able to improve the coordination ability of a variety of pre-trained policies."