EfficientZero V2 introduces a general framework for sample-efficient RL algorithms, extending performance to various domains. The algorithm achieves superior outcomes in 50 of 66 evaluated tasks across benchmarks like Atari 100k, Proprio Control, and Vision Control. Previous studies have introduced algorithms aimed at enhancing sample efficiency but have not consistently achieved superior performance across multiple domains. EfficientZero V2 addresses the challenge of achieving high-level performance with limited data by proposing key algorithmic enhancements such as sampled-based tree search for action planning and search-based value estimation strategy. The method learns a predictive model in a latent space and performs planning over actions using this model. The training process involves supervised learning methods and temporal consistency to strengthen the supervision between predicted and true states. EZ-V2 also introduces a sampling-based Gumbel search for policy improvement in continuous control scenarios. The policy learning process involves obtaining target policies from tree search and supervised learning using these targets. Additionally, EZ-V2 proposes Search-Based Value Estimation (SVE) to better utilize off-policy data by generating imagined trajectories for root value estimations.
To Another Language
from source content
arxiv.org
Önemli Bilgiler Şuradan Elde Edildi
by Shengjie Wan... : arxiv.org 03-04-2024
https://arxiv.org/pdf/2403.00564.pdfDaha Derin Sorular