Core Concepts
GOMs enable quick adaptation to new tasks by modeling all possible outcomes in a reward and policy agnostic manner, avoiding compounding errors.
Abstract
This article introduces Generalized Occupancy Models (GOMs) as a novel approach in reinforcement learning. GOMs aim to provide adaptive decision-making by modeling the distribution of all possible long-term outcomes from a given state under various reward functions. The key idea behind GOMs is to avoid compounding errors that arise in traditional model-based RL algorithms. The article discusses the theoretical framework, practical instantiation, and experimental evaluation of GOMs in various simulated robotics problems.
Introduction
Reinforcement learning agents must be generalists, capable of adapting to varying tasks.
Model-based RL algorithms face challenges with compounding errors in long-horizon problems.
GOMs propose a solution by modeling all possible outcomes in a reward and policy agnostic manner.
Data Extraction
"GOMs avoid compounding error while retaining generality across arbitrary reward functions."
"GOMs model the distribution of all possible long-term outcomes from a given state under the coverage of a stationary dataset."
Related Work
GOMs are compared to multi-task RL methods and successor features.
Model-based RL and off-policy RL algorithms are discussed in contrast to GOMs.
Preliminaries
GOMs adopt an off-policy dynamic programming technique to model cumulative outcomes in the future.
The distribution of all possible outcomes is modeled in a policy-agnostic manner.
Generalized Occupancy Models
GOMs learn cumulative features and model all possible outcomes in the environment.
The framework of GOMs is instantiated using diffusion models for efficient training.
Planning and Adaptation
GOMs synthesize optimal policies for new tasks by inferring task-specific weights and using guided diffusion for planning.
The ability of GOMs to adapt to arbitrary new rewards is highlighted.
Theoretical Analyses
Error analysis of GOMs is conducted to connect estimation errors to policy suboptimality.
GOMs are compared with consistent model-based algorithms in deterministic MDPs.
Experimental Evaluation
GOMs demonstrate superior transfer performance compared to MBRL, successor features, and goal-conditioned RL.
GOMs successfully solve tasks with arbitrary rewards and show the ability to perform trajectory stitching.
Stats
GOMs avoid compounding error while retaining generality across arbitrary reward functions.
GOMs model the distribution of all possible long-term outcomes from a given state under the coverage of a stationary dataset.
Quotes
"GOMs avoid compounding error while retaining generality across arbitrary reward functions."
"GOMs model the distribution of all possible long-term outcomes from a given state under the coverage of a stationary dataset."