Core Concepts
Optimistic Q-learning algorithm achieves sample efficiency in RL without modeling transition probabilities.
Abstract
Authors study sample complexity of online Q-learning with partial dynamics knowledge.
Focus on systems evolving with additive disturbance model.
Algorithm achieves regret without S and A dependency.
Comparison with other Q-learning algorithms.
Empirical results show convergence to optimal policy.
Discussion on horizon dependency, value function assumptions, noise in model, continuous spaces, and future directions.
Stats
We present an optimistic Q-learning method that achieves ˜O(POLY(H)√T) regret under perfect knowledge of f, where T is the total number of interactions with the system.
Our algorithm can learn an approximately optimal policy in a number of samples independent of state and action spaces if only a noisy estimate ˆf of f is available.
The regret of using asymptotically accurate online estimators with our algorithm is ˜O(√H6T + L√HdT).
Quotes
"Our algorithm achieves sample efficiency in RL without explicitly modeling transition probabilities."