The authors study the problem of achieving constant regret guarantees in reinforcement learning (RL) with linear function approximation. They introduce an algorithm called Cert-LSVI-UCB that can handle misspecified linear Markov decision processes (MDPs), where both the transition kernel and the reward function can be approximated by linear functions up to a certain misspecification level.
The key innovations in Cert-LSVI-UCB are:
A novel certified estimator that enables a fine-grained concentration analysis for multi-phase value-targeted regression, allowing the algorithm to establish an instance-dependent regret bound that is constant with respect to the number of episodes.
A constant, instance-dependent, high-probability regret bound of e^O(d^3 H^5/Δ), where d is the feature dimension, H is the horizon length, and Δ is the minimal suboptimality gap, provided that the misspecification level ζ is below e^O(Δ/(√dH^2)).
The constant regret bound does not rely on any prior assumptions on the data distribution, in contrast to previous works that required assumptions such as the "UniSOFT" condition.
The authors also show that their constant regret bound matches the logarithmic expected regret lower bound, suggesting that the result is valid and optimal in terms of the dependence on the suboptimality gap Δ.
Ke Bahasa Lain
dari konten sumber
arxiv.org
Wawasan Utama Disaring Dari
by Weitong Zhan... pada arxiv.org 04-17-2024
https://arxiv.org/pdf/2404.10745.pdfPertanyaan yang Lebih Dalam