Alapfogalmak
Exploration during training can improve generalization performance in reinforcement learning, even for states that cannot be explicitly encountered during training.
Kivonat
This paper explores how exploration during training can be leveraged to improve generalization performance in reinforcement learning, particularly for states that cannot be explicitly encountered during training (unreachable states).
The key insights are:
In the single-task reinforcement learning setting, it is sufficient to learn an optimal policy only on the states encountered by the optimal policy. However, this logic does not transfer to the zero-shot policy transfer (ZSPT) setting, where the goal is to generalize to new tasks.
Recent work has shown that increased exploration during training can improve generalization performance in the ZSPT setting, particularly for tasks where the states encountered during testing are reachable from the training tasks (reachable generalization). The intuition is that training on more reachable states helps the agent learn a more robust policy.
The authors provide new intuition for why exploration can also benefit generalization to unreachable tasks, where the states encountered during testing cannot be reached from the training tasks. They argue that training on more reachable states acts as a form of implicit data augmentation, helping the agent learn features that are invariant to irrelevant task details (like background color).
The authors propose a novel method called Explore-Go that increases the diversity of the agent's starting states by performing a pure exploration phase at the beginning of each episode. This effectively increases the distribution of states the agent trains on, which can improve both reachable and unreachable generalization.
The authors evaluate Explore-Go on an illustrative environment and the Procgen benchmark. On the illustrative environment, Explore-Go significantly outperforms a standard PPO baseline on unreachable generalization. On Procgen, the results are more mixed, with Explore-Go improving performance on some environments but not others. The authors hypothesize this is due to the pure exploration phase not being effective enough in the more complex Procgen environments.
Statisztikák
The paper does not contain any key metrics or important figures to support the author's key logics. The results are presented qualitatively through plots.