Continuous-time Reinforcement Learning with Risk-sensitive Objective via Quadratic Variation Penalty
The paper studies continuous-time reinforcement learning (RL) with a risk-sensitive objective function in the exponential form. It shows that the risk-sensitive RL problem can be transformed into an ordinary, non-risk-sensitive RL problem augmented with a quadratic variation (QV) penalty term on the value function, which captures the variability of the value-to-go along the trajectory. This characterization allows for the straightforward adaptation of existing RL algorithms to incorporate risk sensitivity by adding the realized variance of the value process.