Główne pojęcia
The core message of this paper is to introduce an RL algorithm, the ACE planner, that incorporates a predictive model and off-policy learning elements to achieve sample-efficient exploration in continuous control tasks. The ACE planner leverages a novelty-aware terminal value function and an exponentially re-weighted version of the model-based value expansion (MVE) target to boost exploration and accelerate credit assignment.
Streszczenie
The paper presents the ACE planner, a model-based RL algorithm that combines online planning and off-policy agent learning to address the sample efficiency problem in continuous control tasks. The key highlights and insights are:
- The ACE planner incorporates a predictive model and off-policy learning elements, where an online planner enhanced by a novelty-aware terminal value function is employed for sample collection.
- The authors derive an intrinsic reward based on the forward predictive error within a latent state space, which establishes a solid connection to model uncertainty and allows the agent to effectively overcome the asymptotic performance gap.
- The paper introduces an exponentially re-weighted version of the MVE target, which facilitates accelerated credit assignment for off-policy agent learning.
- Theoretical and experimental validation demonstrate the benefits of the novelty-aware value function learned using the intrinsic reward for online planning.
- The ACE planner consistently outperforms existing state-of-the-art algorithms, showing superior asymptotic performance across diverse environments with dense rewards. It also excels in solving hard exploration problems and remains competitive even in comparison to an oracle agent with access to the shaped reward function.
Statystyki
The paper does not provide specific numerical metrics or figures to support the key logics. The results are presented in the form of performance curves and aggregated benchmark scores.
Cytaty
There are no direct quotes from the content that support the key logics.