Keskeiset käsitteet
Data-Regularised Environment Design (DRED) generates new training levels using a generative model to approximate the context distribution, while employing adaptive level sampling to minimize the mutual information between the agent's internal representation and the training level identities. This enables DRED to achieve significant improvements in zero-shot transfer performance compared to existing adaptive sampling and unsupervised environment design methods.
Tiivistelmä
The paper investigates how the sampling of individual environment instances, or levels, affects the zero-shot generalization (ZSG) ability of reinforcement learning (RL) agents. The authors discover that for deep actor-critic architectures, prioritizing levels according to their value loss minimizes the mutual information between the agent's internal representation and the set of training levels. This provides a theoretical justification for the implicit regularization achieved by certain adaptive sampling strategies.
The authors then turn their attention to unsupervised environment design (UED) methods, which have more control over the data generation mechanism. They find that existing UED methods can significantly shift the training distribution, which translates to low ZSG performance. To prevent both overfitting and distributional shift, the authors introduce Data-Regularised Environment Design (DRED). DRED generates levels using a generative model trained over an initial set of level parameters, reducing distributional shift, and achieves significant improvements in ZSG over adaptive level sampling strategies and UED methods.
The key highlights and insights from the paper are:
- Adaptive sampling strategies like value loss prioritization can be viewed as implicit regularization techniques that minimize the mutual information between the agent's internal representation and the training level identities.
- Existing UED methods can cause significant distributional shift in the training data, leading to poor zero-shot performance.
- DRED combines adaptive sampling with a generative model of the context distribution to generate new training levels. This allows DRED to increase the diversity of the training set while maintaining consistency with the target task semantics, leading to strong zero-shot transfer performance.
- DRED outperforms both adaptive sampling and UED baselines, achieving up to 1.25 times the returns of the next best method on held-out levels, and 2-3 times higher performance on more difficult in-context edge cases.
Tilastot
The agent (yellow) must navigate to the goal (green) while avoiding walls (grey) and only observing tiles directly adjacent to itself.
An agent trained over levels (a)-(c) will transfer zero-shot to level (d) if it has learned a behavior adapted to the task semantics of following blue tiles to the goal location.
Lainaukset
"Autonomous agents trained using deep reinforcement learning (RL) often lack the ability to successfully generalise to new environments, even when they share characteristics with the environments they have encountered during training."
"We discover that, for deep actor-critic architectures sharing their base layers, prioritising levels according to their value loss minimises the mutual information between the agent's internal representation and the set of training levels in the generated training data."
"To prevent both overfitting and distributional shift, we introduce data-regularised environment design (DRED). DRED generates levels using a generative model trained over an initial set of level parameters, reducing distributional shift, and achieves significant improvements in ZSG over adaptive level sampling strategies and UED methods."