Core Concepts
To address stochastic delays in reinforcement learning, it is sufficient to optimize within the set of Markov policies, which is exponentially smaller than that of history-dependent policies.
Abstract
The paper introduces the framework of Stochastic Execution Delay Markov Decision Processes (SED-MDPs) to model environments where actions are executed with random delays. It establishes a key theoretical finding: when the delay realizations are observed, it is sufficient to optimize within the class of Markov policies to achieve optimal performance, rather than history-dependent policies.
Based on this insight, the authors devise Delayed EfficientZero (DEZ), a model-based algorithm that builds upon the EfficientZero framework. DEZ maintains separate queues to track past actions and their delays, using them to accurately predict future states and make decisions accordingly.
The authors thoroughly evaluate DEZ on the Atari suite, considering both constant and stochastic delay settings. Their results show that DEZ significantly outperforms the baseline methods, including the previous state-of-the-art 'Delayed-Q' algorithm, in both delay scenarios.
Stats
The paper does not contain any explicit numerical data or statistics to support the key claims. The main contributions are the theoretical analysis and the novel algorithm DEZ.
Quotes
The paper does not contain any striking quotes that support the key arguments.