Core Concepts
This paper proposes a framework for mapping non-Markov reward functions into equivalent Markov representations by learning a specialized reward automaton called a Reward Machine (RM). The approach learns RMs without requiring access to a set of high-level propositional symbols, instead inferring hidden triggers directly from data.
Abstract
The paper addresses the challenge of reinforcement learning (RL) algorithms that assume a Markov reward function, when in complex environments the reward function may not be Markov. To address this, the authors propose a framework for mapping non-Markov reward functions into equivalent Markov representations by learning a specialized reward automaton called a Reward Machine (RM).
Key highlights:
RMs are more expressive than Deterministic Finite-State Automata (DFAs) as they can represent complex reward behavior in a single automaton, rather than requiring multiple DFAs.
The authors do not require access to a set of high-level propositional symbols and a labeling function, which is a limitation of prior work. Instead, they infer hidden triggers directly from data that encode the non-Markov reward dynamics.
The mapping process is formulated as an Integer Linear Programming (ILP) problem, which allows for powerful off-the-shelf discrete optimization solvers.
The authors prove that the Abstract Reward Markov Decision Process (ARMDP), which is the cross-product of the RM and the observed state space, is a suitable representation for maximizing reward expectations under non-Markov rewards.
Experiments on the Officeworld and Breakfastworld domains demonstrate the effectiveness of the approach, particularly in learning RMs with interdependent reward signals.
Stats
The paper does not provide any specific numerical data or metrics to support the key claims. The results are presented qualitatively through figures and descriptions of the learning performance and properties of the learned representations.
Quotes
"RMs are particularly appealing as they offer succinct encoding of non-Markov reward behavior and task decomposition."
"By learning RMs we can model reward dependencies in a single automata by representing patterns in H = (S × A × R)∗."
"Importantly, we show how by leveraging H = (S × A × R)∗ we can expedite learning in these cases."