Off-Policy Policy Gradient Method with Optimal Action-Dependent Baseline
The proposed Off-OAB method incorporates an optimal action-dependent baseline to effectively reduce the variance of the off-policy policy gradient (OPPG) estimator, leading to improved sample efficiency in policy learning.