Core Concepts
This work proposes a method to learn non-Markovian safety constraints from labeled trajectory data and use them to enable safe reinforcement learning.
Abstract
The key highlights and insights from the content are:
The authors address a general setting where safety labels (safe or unsafe) are associated with state-action trajectories, rather than just the immediate state-action pair. This allows modeling non-Markovian safety constraints.
They design a unique safety model that performs credit assignment to assess the contributions of partial state-action trajectories on safety. This safety model is trained using a labeled safety dataset.
Using an RL-as-inference strategy, the authors derive an effective algorithm called SafeSAC-H for optimizing a safe policy using the learned safety model. SafeSAC-H extends the off-policy Soft Actor-Critic (SAC) method.
The authors devise a method to dynamically adapt the trade-off coefficient between reward maximization and safety compliance during training. This removes the need for manual tuning of the trade-off.
The empirical results demonstrate that the proposed approach is highly scalable and able to satisfy sophisticated non-Markovian safety constraints across a variety of continuous control domains.
Stats
The content does not contain any explicit numerical data or metrics. The key figures and statistics are presented in the form of reward and safety performance plots.