Centrala begrepp
This paper introduces UNIQ, a novel algorithm that leverages inverse Q-learning to train agents that can effectively avoid undesirable behaviors by learning from both undesirable and unlabeled demonstrations.
Statistik
UNIQ consistently achieves the lowest cost across all experiments on Safety-Gym tasks, indicating its effectiveness in avoiding undesirable behaviors.
In Point-Button and Car-Button tasks, UNIQ achieves a lower return compared to some baselines, suggesting a trade-off between maximizing return and minimizing undesirable actions.
Increasing the size of the undesirable dataset generally leads to a reduction in cost for all approaches, with UNIQ demonstrating the most significant improvement.
UNIQ achieves a lower cost than BC-safe (Behavioral Cloning with only desired demonstrations) in the ablation study, highlighting its ability to effectively leverage undesirable demonstrations for learning safer policies.
Citat
"In this paper, we develop a principled framework for learning from undesirable demonstrations, based on the well-known MaxEnt RL framework (Ziebart et al., 2008) and inverse Q-learning (Garg et al., 2021) — a state-of-the-art imitation learning method."
"Our algorithm can be seen as a reverse version of the standard Inverse Q-learning algorithm (Garg et al., 2021), where the goal is not to imitate but rather to avoid undesired behaviors."
"Our method consistently achieves the lowest cost across all experiments. However, in the Point-Button and Car-Button tasks, the return for our method is lower, as it avoids undesired actions, leaving no high-return options to pursue."