The paper studies continuous-time reinforcement learning (RL) with a risk-sensitive objective function in the exponential form. It shows that the risk-sensitive RL problem can be transformed into an ordinary, non-risk-sensitive RL problem augmented with a quadratic variation (QV) penalty term on the value function, which captures the variability of the value-to-go along the trajectory. This characterization allows for the straightforward adaptation of existing RL algorithms to incorporate risk sensitivity by adding the realized variance of the value process.


coremsg

continuous-time-reinforcement-learning-with-risk-sensitive-objective-via-quadratic-variation-penalty


Continuous-time Reinforcement Learning with Risk-sensitive Objective via Quadratic Variation Penalty