Constant Regret Reinforcement Learning Algorithm for Misspecified Linear Markov Decision Processes
The authors introduce Cert-LSVI-UCB, a novel reinforcement learning algorithm that achieves a constant, instance-dependent, high-probability regret bound in misspecified linear Markov decision processes, without relying on any prior assumptions on data distributions.