Centrala begrepp
The core message of this paper is to propose a novel strategy employing diverse randomized value functions to estimate the posterior distribution of Q-values, which provides robust uncertainty quantification and estimates lower confidence bounds (LCB) of Q-values. By applying moderate value penalties for out-of-distribution (OOD) actions, the proposed method fosters a provably pessimistic approach.
Sammanfattning
The paper introduces a novel strategy called Diverse Randomized Value Functions (DRVF) for offline reinforcement learning. The key highlights are:
- DRVF employs Bayesian neural networks (BNNs) to approximate the Bayesian posterior of the value function, which provides robust uncertainty quantification and estimates the lower confidence bound (LCB) of Q-values.
- The paper introduces a repulsive regularization term to enhance the diversity among the samples from the ensemble BNNs, leading to improved parametric efficiency.
- Theoretical analysis shows that the proposed Bayesian uncertainty is equivalent to the LCB penalty under linear MDP assumptions, providing provable efficient pessimism.
- Extensive empirical results demonstrate that DRVF significantly outperforms baseline methods in terms of performance and parametric efficiency on the D4RL benchmarks.
Statistik
The paper does not provide any specific numerical data or statistics to support the key claims. The results are presented in the form of performance scores on various Gym Mujoco tasks.