Core Concepts
Constrained normalizing flow policies enable interpretable and safe-by-construction reinforcement learning agents by exploiting domain knowledge to analytically construct invertible transformations that map actions into the allowed constraint regions.
Abstract
The paper proposes a method for constructing constrained normalizing flow policies (CNFP) to address the issues of interpretability and safety in reinforcement learning (RL).
Key highlights:
RL policies represented by black-box neural networks are typically non-interpretable and not well-suited for safety-critical domains.
The authors exploit domain knowledge to analytically construct invertible transformations that map actions into the allowed constraint regions, ensuring constraint satisfaction.
The normalizing flow corresponds to an interpretable sequence of transformations, each aligning the policy with respect to a particular constraint.
Experiments on a 2D point navigation task show that the CNFP agent learns the task as quickly as an unconstrained agent while maintaining perfect constraint satisfaction throughout training.
The interpretable nature of the CNFP allows for inspection and verification of the agent's behavior, unlike monolithic policies learned by baseline methods.
The authors highlight the potential for future work on developing non-convex transformation functions to broaden the applicability of the approach.
Stats
The agent's battery level should be kept above 20%.
Executing an action that would lead to a collision with an obstacle is not allowed.
Quotes
"Our method builds on recent normalizing flow policies [8, 9], where a normalizing flow model is employed to learn a complex, multi-modal policy distribution. We show that by exploiting domain knowledge one can analytically construct intermediate flow steps that correspond to particular (safety-) constraints."
"Importantly, solely rejecting actions that violate constraints does not suffice, since this would lead to biased gradient estimates [13, 15]."
"Theoretically, Mφ can then be optimized with regular RL algorithms that then satisfy the constraints, even during learning, by construction."