Core Concepts
Optimizing the expected value of non-ergodic rewards can lead to policies that receive exceptionally high returns with probability zero but almost surely result in catastrophic outcomes. Transforming the returns to have ergodic increments enables learning robust policies by optimizing the long-term return for individual agents rather than the average across infinitely many trajectories.
Abstract
The paper discusses the impact of ergodicity on the choice of the optimization criterion in reinforcement learning (RL). If the rewards are non-ergodic, focusing on the expected return yields non-robust policies that are commonly found with conventional RL algorithms.
The key insights are:
Non-ergodicity: In non-ergodic settings, the average over many trajectories (expected value) differs from the average along one long trajectory. This can lead to policies that perform exceptionally well on average but fail catastrophically in individual instances.
Ergodicity transformation: An alternative to changing the objective function is to find a transformation that converts the time series of returns into one with ergodic increments. Optimizing the expected value of these ergodic increments is equivalent to maximizing the long-term growth rate of the return for individual agents.
Learning the transformation: The paper proposes a method for learning an ergodicity transformation directly from data, without requiring analytical expressions for the environment dynamics.
Relation to risk-sensitive RL: The authors analyze how transformations used in risk-sensitive RL can be motivated from an ergodicity perspective, providing a theoretical foundation for these approaches.
Experiments: The authors demonstrate the effectiveness of the proposed transformation on standard RL benchmark environments, showing that it can lead to more robust policies compared to standard RL algorithms.
The paper highlights the importance of considering ergodicity when designing RL algorithms, especially in domains where the consequences of failure can be catastrophic. The proposed approach provides a principled way to learn policies that optimize the long-term performance of individual agents rather than the average across many trajectories.
Stats
The paper does not contain any key metrics or important figures to support the author's key logics. The analysis is primarily conceptual, focusing on the theoretical implications of non-ergodicity in reinforcement learning.
Quotes
The paper does not contain any striking quotes supporting the author's key logics.