核心概念
We develop algorithms to jointly learn a global reward function and agent-specific discount factors from expert demonstrations with different planning horizons.
要約
The paper studies an inverse reinforcement learning (IRL) problem where experts are planning under a shared reward function but with different, unknown planning horizons. Without the knowledge of discount factors, the reward function has a larger feasible solution set, making it harder for existing IRL approaches to identify the reward function.
To address this challenge, the authors develop two algorithms:
-
Multi-Planning Horizon LP-IRL (MPLP-IRL):
- Extends the linear programming IRL (LP-IRL) approach to handle multiple planning horizons.
- Avoids undesirable solutions by maximizing the minimal non-zero difference of Q-functions over states where expert policies are distinguishable.
- Performs a bi-level optimization to jointly learn the reward function and discount factors.
-
Multi-Planning Horizon MCE-IRL (MPMCE-IRL):
- Extends the max causal entropy IRL (MCE-IRL) approach to handle multiple planning horizons.
- Shows that strong duality does not hold between the MPMCE-IRL problem and its Lagrangian dual, making the inference less tractable.
- Proposes a bi-level optimization approach to jointly learn the reward function and discount factors.
The authors provide theoretical analyses on the identifiability of the reward function and discount factors. They show that with a sufficiently large number of experts, both the reward function and discount factors become identifiable.
Experiments on three domains demonstrate that the learned reward functions generalize well to similar tasks, and the algorithms converge quickly compared to exhaustive grid search.
統計
The value function V^(γ^, r^)(s_0) of expert policies π^* under the true reward r^*.
The value function V^(r̂, γ̂)(s_0) of reconstructed optimal policies π under the learned reward function r̂.