מושגי ליבה
The Bellman equation for the two-discount-factor surrogate reward used for LTL objectives may have multiple solutions when one of the discount factors is set to 1, leading to inaccurate policy evaluation. A sufficient condition to ensure the Bellman equation has a unique solution equal to the value function is that the solution for states in rejecting bottom strongly connected components (BSCCs) is set to 0.
תקציר
The content discusses the uniqueness of the solution to the Bellman equation when using a two-discount-factor surrogate reward approach for planning problems with linear temporal logic (LTL) objectives on Markov decision processes (MDPs).
Key highlights:
- The two-discount-factor surrogate reward approach is commonly used to translate LTL objectives into a form suitable for reinforcement learning. It assigns a constant positive reward to "good" states and a constant discount factor.
- Previous works have allowed setting one of the discount factors to 1, but this can lead to the Bellman equation having multiple solutions, which can mislead the reinforcement learning algorithm to converge to a suboptimal policy.
- The authors demonstrate this issue with a concrete example, where the Bellman equation has multiple solutions when one discount factor is 1.
- The authors propose a sufficient condition to ensure the Bellman equation has a unique solution equal to the value function: the solution for states in rejecting bottom strongly connected components (BSCCs) must be set to 0.
- The authors prove this condition is sufficient by showing that when one discount factor is 1, the solution can be separated into states with discounting and states without discounting, and the unique solution can be derived for each part.
The content provides a thorough analysis of the uniqueness issue in the Bellman equation for the two-discount-factor surrogate reward approach and proposes a solution to ensure the value function is the unique fixed point of the Bellman operator, which is crucial for the convergence of reinforcement learning algorithms.