Belangrijkste concepten
A dual generative replay framework that leverages diffusion models to model states and behaviors with high fidelity, allowing the continual learning policy to inherit the distributional expressivity and mitigate catastrophic forgetting.
Samenvatting
The paper proposes an efficient Continual learning method via diffusion-based dual Generative Replay for Offline RL (CuGRO) to address the challenges in continual offline reinforcement learning (CORL).
Key highlights:
- CuGRO decouples the continual learning policy into an expressive generative behavior model and an action evaluation model, allowing the policy to inherit distributional expressivity for encompassing a progressive range of diverse behaviors.
- CuGRO introduces a state generative model to mimic previous state distributions conditioned on task identity, and pairs the generated states with corresponding responses from the behavior generative model to represent old tasks.
- CuGRO leverages diffusion probabilistic models to model states and corresponding behaviors with high fidelity, enabling efficient generative replay without storing past samples.
- CuGRO interleaves replayed samples with real ones of the new task to continually update the state and behavior generators, and uses a multi-head critic with behavior cloning to mitigate forgetting.
- Experiments demonstrate that CuGRO achieves better forward transfer with less forgetting, and closely approximates the results of using previous ground-truth data due to its high-fidelity replay of the sample space.
Statistieken
"We assume that the task follows a distribution Mk = (S, A, P, r, γ)k ∼P(M). The learner is presented with an infinite sequence of tasks [M1, ..., Mk, ...], and for each task Mk, an offline dataset Dµk = P
i(si, ai, ri, s′
i)k is collected by a behavior policy µk."
"The objective for CORL is to learn a continual policy that maximizes the expected return over all encountered tasks as J(πcontinual) = P
K
k=1 JMk(πMk)."
Citaten
"To the best of our knowledge, CuGRO is the first that leverages expressive diffusion models to tackle the understudied CORL challenge."
"Diffusion probabilistic models [27, 30] are utilized to fit the behavior distribution from the offline dataset."
"We leverage existing advances in diffusion probabilistic models [27] to model states and corresponding behaviors with high fidelity, allowing the continual policy to inherit the distributional expressivity."