Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning
The core message of this paper is to propose a novel strategy employing diverse randomized value functions to estimate the posterior distribution of Q-values, which provides robust uncertainty quantification and estimates lower confidence bounds (LCB) of Q-values. By applying moderate value penalties for out-of-distribution (OOD) actions, the proposed method fosters a provably pessimistic approach.