The paper focuses on the overestimation problem in reinforcement learning (RL)-based dialogue policy learning for task-completion dialogue systems. The overestimation bias in the maximum action value estimation can lead to inaccurate action values and suboptimal dialogue policies.
To address this issue, the paper proposes a dynamic partial average (DPAV) estimator. DPAV calculates the partial average between the predicted maximum action value and the predicted minimum action value, where the weights are dynamically adaptive and problem-dependent. This helps to shift the estimation towards the ground truth maximum action value and mitigate the overestimation bias.
The DPAV estimator is incorporated into a deep Q-network as the dialogue policy module. Experiments on three dialogue datasets show that the DPAV DQN method can achieve better or comparable results compared to top baselines, with a lower computational load. The paper also provides theoretical analysis on the convergence and the upper/lower bounds of the bias for the DPAV estimator.
The key highlights and insights are:
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問