Mitigating Overestimation Bias in Reinforcement Learning-based Task-Completion Dialogue Systems
The paper proposes a novel dynamic partial average (DPAV) estimator to mitigate the overestimation bias in reinforcement learning-based dialogue policy learning, which improves the accuracy of action value estimation and leads to better dialogue performance.