Concepts de base
Constrained normalizing flow policies enable interpretable and safe-by-construction reinforcement learning agents by exploiting domain knowledge to analytically construct invertible transformations that map actions into the allowed constraint regions.
Résumé
The paper proposes a method for constructing constrained normalizing flow policies (CNFP) to address the issues of interpretability and safety in reinforcement learning (RL).
Key highlights:
- RL policies represented by black-box neural networks are typically non-interpretable and not well-suited for safety-critical domains.
- The authors exploit domain knowledge to analytically construct invertible transformations that map actions into the allowed constraint regions, ensuring constraint satisfaction.
- The normalizing flow corresponds to an interpretable sequence of transformations, each aligning the policy with respect to a particular constraint.
- Experiments on a 2D point navigation task show that the CNFP agent learns the task as quickly as an unconstrained agent while maintaining perfect constraint satisfaction throughout training.
- The interpretable nature of the CNFP allows for inspection and verification of the agent's behavior, unlike monolithic policies learned by baseline methods.
- The authors highlight the potential for future work on developing non-convex transformation functions to broaden the applicability of the approach.
Stats
The agent's battery level should be kept above 20%.
Executing an action that would lead to a collision with an obstacle is not allowed.
Citations
"Our method builds on recent normalizing flow policies [8, 9], where a normalizing flow model is employed to learn a complex, multi-modal policy distribution. We show that by exploiting domain knowledge one can analytically construct intermediate flow steps that correspond to particular (safety-) constraints."
"Importantly, solely rejecting actions that violate constraints does not suffice, since this would lead to biased gradient estimates [13, 15]."
"Theoretically, Mφ can then be optimized with regular RL algorithms that then satisfy the constraints, even during learning, by construction."