The paper focuses on the discounted discrete-time Linear Quadratic Regulator (LQR) problem, where the system parameters are unknown. The key contributions are:
The authors propose a new gradient estimation scheme inspired by the REINFORCE method, which relies on appropriately sampling deterministic policies. This allows them to achieve high-probability upper bounds on the gradient estimates using moment concentration inequalities.
By adopting time-varying learning rates, the authors' methodology enables them to reach an O(1/ε) convergence rate, circumventing the need for two-point gradient estimations, which are known to be unrealistic in many settings.
The authors provide a detailed analysis of the regularity properties of the LQR cost function, including local Lipschitz continuity, local smoothness, and the Polyak-Lojasiewicz (PL) condition. These properties are crucial for establishing the convergence guarantees.
The authors show that their proposed algorithm achieves ε-optimality with a sample complexity of Õ(1/ε), substantially improving upon the previous best-known results, which either had a sample complexity of O(1/ε^2) or relied on additional stability assumptions.
The paper presents a significant advancement in the understanding and optimization of the discounted discrete-time LQR problem in the model-free setting, with potential applications in various control and reinforcement learning domains.
다른 언어로
소스 콘텐츠 기반
arxiv.org
더 깊은 질문