Core Concepts
Proximal Policy Optimization (PPO) agents suffer from deteriorating representations, leading to a collapse in performance that is irrecoverable due to a loss of plasticity.
Abstract
The study examines the representation dynamics of Proximal Policy Optimization (PPO) agents in the Arcade Learning Environment (ALE) and MuJoCo environments. The key findings are:
PPO agents exhibit a consistent increase in the norm of the pre-activations of the policy network's feature layer, which is correlated with a decline in the feature rank over time. This representation collapse is observed in both the ALE and MuJoCo environments, across different model architectures and activation functions.
Increasing the number of optimization epochs per rollout, which amplifies the non-stationarity in PPO, accelerates the growth of the pre-activation norm and the collapse of the feature rank. This ultimately leads to a collapse in the policy's performance in some environments.
The representation collapse undermines the effectiveness of PPO's clipping mechanism, as the trust region constraint becomes unreliable when the representations are poor. This creates a snowball effect, where the representation degradation prevents the agent from improving its policy through the PPO objective.
The performance collapse is irrecoverable due to a significant increase in the plasticity loss, indicating a loss of the agent's ability to fit new targets and adapt to changes.
Interventions that regularize the representation dynamics, such as Proximal Feature Optimization (PFO), which penalizes changes in the pre-activations, can mitigate the representation collapse and improve the agent's performance. Sharing the feature trunk between the actor and critic also shows promise, but its effectiveness depends on the reward sparsity of the environment.
Overall, the study highlights the importance of monitoring representation dynamics in policy optimization methods like PPO, as representation collapse can lead to trust region issues and unrecoverable performance decline.
Stats
The norm of the pre-activations of the policy network's feature layer consistently increases over training.
The feature rank of the policy network declines over training, eventually collapsing in some environments.
The plasticity loss of the policy network increases significantly around the time of performance collapse.
Quotes
"We observe a consistent increase in the norm of the pre-activations of the feature layer of the policy network."
"We observe a rank decline in five out of six ALE games and seven out of eight MuJoCo tasks."
"The rank eventually collapses, which coincides with a collapse in the policy's performance."