The paper focuses on the discounted discrete-time Linear Quadratic Regulator (LQR) problem, where the system parameters are unknown. The key contributions are:
The authors propose a new gradient estimation scheme inspired by the REINFORCE method, which relies on appropriately sampling deterministic policies. This allows them to achieve high-probability upper bounds on the gradient estimates using moment concentration inequalities.
By adopting time-varying learning rates, the authors' methodology enables them to reach an O(1/ε) convergence rate, circumventing the need for two-point gradient estimations, which are known to be unrealistic in many settings.
The authors provide a detailed analysis of the regularity properties of the LQR cost function, including local Lipschitz continuity, local smoothness, and the Polyak-Lojasiewicz (PL) condition. These properties are crucial for establishing the convergence guarantees.
The authors show that their proposed algorithm achieves ε-optimality with a sample complexity of Õ(1/ε), substantially improving upon the previous best-known results, which either had a sample complexity of O(1/ε^2) or relied on additional stability assumptions.
The paper presents a significant advancement in the understanding and optimization of the discounted discrete-time LQR problem in the model-free setting, with potential applications in various control and reinforcement learning domains.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Amirreza Nes... lúc arxiv.org 04-18-2024
https://arxiv.org/pdf/2404.10851.pdfYêu cầu sâu hơn