Core Concepts
Policy-guided diffusion generates synthetic trajectories that balance action likelihoods under both the target and behavior policies, leading to plausible trajectories with high target policy probability while retaining low dynamics error.
Abstract
The content discusses a method called policy-guided diffusion (PGD) for generating synthetic training data in offline reinforcement learning (RL) settings.
The key insights are:
- Offline RL suffers from distribution shift between the behavior policy (which collected the offline data) and the target policy being trained. This leads to an out-of-sample issue where the target policy explores regions underrepresented in the offline data.
- Prior work has proposed using autoregressive world models to generate synthetic on-policy experience. However, these models suffer from compounding error, forcing short rollouts that limit coverage.
- PGD instead models entire trajectories using diffusion models, which avoid compounding error. It then applies guidance from the target policy to shift the sampling distribution towards high-likelihood actions under the target policy.
- This yields a "behavior-regularized target distribution" that balances action likelihoods under both the behavior and target policies. This retains the benefits of diffusion (low dynamics error) while generating trajectories more representative of the target policy.
- Experiments show that agents trained on PGD-generated synthetic data outperform those trained on real or unguided synthetic data, across a range of environments and behavior policies. PGD also achieves lower dynamics error than prior autoregressive world model approaches.
Stats
The content does not provide any specific numerical data or metrics. It focuses on describing the policy-guided diffusion method and comparing it qualitatively to prior approaches.
Quotes
The content does not contain any direct quotes that are particularly striking or support the key arguments.