Li, Andrew C., Chen, Zizhao, Klassen, Toryn Q., Vaezipoor, Pashootan, Icarte, Rodrigo Toro, & McIlraith, Sheila A. (2024). Reward Machines for Deep RL in Noisy and Uncertain Environments. Advances in Neural Information Processing Systems, 38. arXiv:2406.00120v3 [cs.LG]
This paper addresses the challenge of using Reward Machines (RMs) for deep reinforcement learning (RL) in realistic environments where the agent's perception of task-relevant propositions is uncertain due to partial observability and noisy sensing. The authors aim to develop RL algorithms that can effectively leverage the structured task representation provided by RMs even when the ground-truth interpretation of the vocabulary is unavailable.
The authors formalize the problem of learning in Noisy Reward Machine Environments, which are characterized as Partially Observable Markov Decision Processes (POMDPs). They propose a framework that decouples the inference of the current RM state from the decision-making process. Within this framework, they introduce three different inference modules: Naive, Independent Belief Updating (IBU), and Temporal Dependency Modelling (TDM). Each module utilizes an abstraction model, which can be a pre-trained neural network, sensor, or heuristic, to map observations to a belief over RM states. The authors evaluate their proposed methods on a range of challenging environments, including a toy Gold Mining Problem, two MiniGrid environments with image observations (Traffic Light and Kitchen), and a MuJoCo robotics environment (Color Matching).
This research highlights the importance of considering uncertainty in propositional evaluations when applying formal languages like RMs to deep RL in realistic settings. The proposed TDM method provides a promising solution for leveraging task structure to improve learning efficiency and robustness in such environments.
This work significantly contributes to the field of deep RL by extending the applicability of RMs to more realistic and challenging scenarios. The proposed framework and algorithms have the potential to enable the use of formal task specifications in a wider range of real-world applications, including robotics, autonomous driving, and human-robot interaction.
The current work assumes access to ground-truth rewards during training, which may not always be feasible in practice. Future research could explore methods for learning in Noisy RM Environments with only sparse or delayed rewards. Additionally, investigating the use of more sophisticated abstraction models, such as large language models, could further enhance the performance and generalization capabilities of the proposed framework.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Andrew C. Li... at arxiv.org 11-07-2024
https://arxiv.org/pdf/2406.00120.pdfDeeper Inquiries