The paper explores continuous-time and state-space optimal stopping (OS) problems from a reinforcement learning (RL) perspective. It begins by formulating the stopping problem using randomized stopping times, where the decision maker's control is represented by the probability of stopping within a given time. To encourage exploration and facilitate learning, the authors introduce a regularized version of the problem by penalizing the performance criterion with the cumulative residual entropy of the randomized stopping time.
The regularized problem takes the form of an (n+1)-dimensional degenerate singular stochastic control with finite-fuel. The authors address this through the dynamic programming principle, which enables them to identify the unique optimal exploratory strategy. For a specific real option problem, they derive a semi-explicit solution to the regularized problem, allowing them to assess the impact of entropy regularization and analyze the vanishing entropy limit.
Finally, the authors propose a reinforcement learning algorithm based on policy iteration. They show both policy improvement and policy convergence results for the proposed algorithm.
翻譯成其他語言
從原文內容
arxiv.org
深入探究