Core Concepts
The core message of this work is to introduce Maximum Mean Discrepancy Q-Learning (MMD-QL) as a novel algorithm that utilizes MMD barycenters to effectively propagate the uncertainty of value functions during Temporal Difference updates in reinforcement learning, leading to improved exploration and performance.
Abstract
The content presents a new reinforcement learning algorithm called Maximum Mean Discrepancy Q-Learning (MMD-QL) that aims to improve upon the existing Wasserstein Q-Learning (WQL) algorithm by leveraging the Maximum Mean Discrepancy (MMD) barycenter to propagate the uncertainty of value functions.
Key highlights:
MMD-QL maintains Q-posteriors and V-posteriors to express the uncertainty in the value function estimates.
During Temporal Difference (TD) updates, MMD-QL modifies the classic TD update rule to induce epistemic uncertainty (from estimating the reward and transition kernel) and aleatoric uncertainty (from approximating the next-state value function).
MMD-QL employs a variational update scheme based on MMD barycenters to approximate the posterior distributions, as MMD provides a tighter similarity estimate between probability measures compared to the Wasserstein distance.
The authors establish that MMD-QL is Probably Approximately Correct in MDP (PAC-MDP) under the average loss metric, implying it is as efficient as WQL in the worst case.
Experiments on tabular environments show that MMD-QL outperforms or matches the performance of WQL.
The authors also introduce MMD Q-Network (MMD-QN), a deep variant of MMD-QL, and provide theoretical analysis on its convergence rates using function approximation.
Empirical results on challenging Atari games demonstrate that MMD-QN performs well compared to benchmark deep RL algorithms, highlighting its effectiveness in handling large state-action spaces.
Stats
The content does not contain any explicit numerical data or statistics. It focuses on the theoretical and empirical analysis of the proposed algorithms.
Quotes
"Accounting for the uncertainty of value functions boosts exploration in Reinforcement Learning (RL)."
"MMD is chosen because it provides a tighter similarity estimate between probability measures than the Wasserstein distance."
"Empirical results on challenging Atari games demonstrate that MMD-QN performs impressively compared to WQL and other benchmark algorithms for deep RL."