The paper focuses on softmax policy gradient (PG) methods for bandits and tabular Markov decision processes (MDPs). Recent theoretical research has analyzed PG methods in simplified settings, exploiting the objective's properties to guarantee global convergence to an optimal policy. However, the resulting algorithms require the knowledge of unknown problem-dependent quantities, making them impractical.
The authors make the following contributions:
In the exact setting, they propose using an Armijo line-search to set the step-size for softmax PG, which enables adaptation to the objective's local smoothness and results in larger step-sizes and improved empirical performance. They also design an alternative line-search condition that takes advantage of the objective's non-uniform smoothness.
In the stochastic setting, the authors utilize exponentially decreasing step-sizes and characterize the convergence rate of the resulting algorithm. The proposed algorithm matches the state-of-the-art convergence rates without requiring the knowledge of any oracle-like information.
For the multi-armed bandit setting, the authors' techniques result in a theoretically-principled PG algorithm that does not require explicit exploration, the knowledge of the reward gap, the reward distributions, or the noise.
In the appendix, the authors study the use of entropy regularization for PG methods in both the exact and stochastic settings. They introduce a practical multi-stage algorithm that iteratively reduces the entropy regularization and ensures convergence to the optimal policy without requiring the knowledge of any problem-dependent constants.
The proposed methods offer similar theoretical guarantees as the state-of-the-art results but do not require the knowledge of oracle-like quantities, making them more practical.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Michael Lu, ... at arxiv.org 10-01-2024
https://arxiv.org/pdf/2405.13136.pdfDeeper Inquiries