核心概念
Optimistic Q-learning algorithm achieves sample efficiency in RL without modeling transition probabilities.
統計資料
We present an optimistic Q-learning method that achieves ˜O(POLY(H)√T) regret under perfect knowledge of f, where T is the total number of interactions with the system.
Our algorithm can learn an approximately optimal policy in a number of samples independent of state and action spaces if only a noisy estimate ˆf of f is available.
The regret of using asymptotically accurate online estimators with our algorithm is ˜O(√H6T + L√HdT).
引述
"Our algorithm achieves sample efficiency in RL without explicitly modeling transition probabilities."