We develop algorithms to jointly learn a global reward function and agent-specific discount factors from expert demonstrations with different planning horizons.
The core message of this paper is that a Bayesian approach to model-based inverse reinforcement learning (BM-IRL) can lead to robust policies by simultaneously estimating the expert's reward function and their internal model of the environment dynamics. This is achieved by incorporating a prior that encodes the accuracy of the expert's dynamics model, which encourages the learner to plan against the worst-case dynamics outside the offline data distribution.