Core Concepts
This paper presents the first algorithm for model-based offline quantum reinforcement learning and demonstrates its functionality on the cart-pole benchmark.
Abstract
This paper introduces a novel approach for model-based offline quantum reinforcement learning (QRL). The key aspects are:
The model and the policy to be optimized are each implemented as variational quantum circuits (VQCs).
The model is trained by gradient descent to fit a pre-recorded data set, representing the environment dynamics.
The policy is optimized using a gradient-free optimization scheme (particle swarm optimization) with the return estimate given by the model as the fitness function.
This model-based approach allows, in principle, full realization on a quantum computer during the optimization phase and provides hope that a quantum advantage can be achieved as quantum computing technology matures.
The authors demonstrate the functionality of their approach on the classical cart-pole control benchmark. The results show that the VQC surrogate model can accurately capture the environment dynamics, enabling the discovery of policies that can reliably balance the cart-pole system. The authors also investigate the impact of data re-uploading and data efficiency on the VQC surrogate model performance, comparing it to classical neural networks.
The proposed method represents the first work on model-based offline QRL, which the authors argue is of particular importance due to the practical advantages of offline RL and the challenges of data transfer between classical and quantum components in online QRL.
Stats
The data set used for training the surrogate model consists of 10,000 observations from 442 episodes generated by a random policy on the cart-pole environment.
The average episode length is 22.6 steps.
Quotes
"This paper presents the first algorithm for model-based offline quantum reinforcement learning and demonstrates its functionality on the cart-pole benchmark."
"The model and the policy to be optimized are each implemented as variational quantum circuits (VQCs)."
"The model is trained by gradient descent to fit a pre-recorded data set. The policy is optimized with a gradient-free optimization scheme using the return estimate given by the model as the fitness function."