Centrala begrepp
Local simulator access enables sample-efficient reinforcement learning for MDPs with low coverability, including challenging settings like Exogenous Block MDPs, using only realizability of the optimal state-value function.
Sammanfattning
The content discusses the power of local simulator access in reinforcement learning (RL) and presents new algorithms and guarantees for online RL with general function approximation.
Key highlights:
- The authors introduce the SimGolf algorithm, which leverages local simulator access to achieve sample-efficient learning for MDPs with low coverability, requiring only realizability of the optimal state-action value function. This significantly relaxes the representation assumptions required by prior algorithms.
- As a consequence, SimGolf is shown to make the notoriously challenging Exogenous Block MDP (ExBMDP) problem tractable in its most general form under local simulator access.
- To address the computational inefficiency of SimGolf, the authors present a more practical algorithm called RVFS (Recursive Value Function Search), which achieves sample-efficient learning guarantees with general value function approximation under a strengthened statistical assumption called pushforward coverability.
- RVFS explores by building core-sets with a novel value function-guided scheme, and can be viewed as a principled counterpart to successful empirical approaches like MCTS and AlphaZero that combine recursive search with value function approximation.
The key technical ideas include:
- Using local simulator access to directly estimate Bellman backups, avoiding the double sampling problem.
- Leveraging coverability and realizability conditions to obtain sample complexity guarantees.
- Designing core-set construction schemes guided by value function approximation to enable computationally efficient exploration.
Overall, the work demonstrates how local simulator access can unlock new statistical and computational guarantees for reinforcement learning with general function approximation that were previously out of reach.
Statistik
The following sentences contain key metrics or figures:
The total sample complexity in the RLLS framework is bounded by Õ(H^5 C_cov^2 log(|Q|/δ) / ε^4).
The total sample complexity in the RLLS framework is bounded by Õ(H^5 S^3 A^3 log|Φ| / ε^4).