The paper introduces a new setting called Continual Offline Reinforcement Learning (CORL), where an agent learns a sequence of offline reinforcement learning tasks and aims to maintain good performance on all learned tasks with a small replay buffer without exploring any of the environments.
To address the key challenges in CORL, the paper proposes two key components:
Model-Based Experience Selection (MBES): This scheme uses a learned transition model to generate trajectories that closely resemble the learned policy, and selects the most valuable experiences from the offline dataset to store in the replay buffer. This helps bridge the distribution shift between the replay buffer and the learned policy.
Dual Behavior Cloning (DBC): This novel architecture separates the policy optimization for the current task from the behavior cloning for previous tasks. This resolves the inconsistency between learning the new task and cloning old tasks, which is a unique issue in the CORL setting.
The paper experimentally verifies the effectiveness of MBES and DBC, and shows that the overall OER method outperforms state-of-the-art baselines on various continuous control tasks.
На другой язык
из исходного контента
arxiv.org
Дополнительные вопросы