Mondal, W. U., & Aggarwal, V. (2024). Sample-Efficient Constrained Reinforcement Learning with General Parameterization. In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024).
This paper addresses the challenge of efficiently solving Constrained Markov Decision Processes (CMDPs) with general parameterized policies, aiming to improve sample complexity and achieve near-optimal solutions.
The authors propose a novel algorithm called Primal-Dual Accelerated Natural Policy Gradient (PD-ANPG), which leverages momentum-based acceleration within a primal-dual framework. They analyze the algorithm's performance theoretically, deriving bounds on its sample complexity.
The PD-ANPG algorithm achieves a sample complexity of Õ((1 − γ)−7ε−2) for ensuring ε-optimality and ε-constraint violation in CMDPs with general parameterized policies. This result significantly improves upon the previous state-of-the-art sample complexity of Õ((1 − γ)−8ε−4) and matches the theoretical lower bound in terms of ε−1.
The paper demonstrates that incorporating acceleration techniques into primal-dual natural policy gradient methods leads to substantial improvements in sample efficiency for constrained reinforcement learning with general parameterized policies. This advancement closes a significant gap between theoretical understanding and practical algorithm design in this domain.
This research makes a substantial contribution to the field of constrained reinforcement learning by providing a theoretically sound and practically efficient algorithm for handling general parameterized policies. This has significant implications for real-world applications of CMDPs, where the state space is often large or continuous, necessitating the use of function approximation and general parameterization.
The paper focuses on the discounted infinite-horizon setting for CMDPs. Exploring the applicability of PD-ANPG to other settings like finite-horizon or average-reward CMDPs could be a potential direction for future research. Additionally, investigating the empirical performance of PD-ANPG on complex benchmark problems would further strengthen its practical relevance.
Ke Bahasa Lain
dari konten sumber
arxiv.org
Wawasan Utama Disaring Dari
by Washim Uddin... pada arxiv.org 11-01-2024
https://arxiv.org/pdf/2405.10624.pdfPertanyaan yang Lebih Dalam