The paper introduces the framework of Stochastic Execution Delay Markov Decision Processes (SED-MDPs) to model environments where actions are executed with random delays. It establishes a key theoretical finding: when the delay realizations are observed, it is sufficient to optimize within the class of Markov policies to achieve optimal performance, rather than history-dependent policies.
Based on this insight, the authors devise Delayed EfficientZero (DEZ), a model-based algorithm that builds upon the EfficientZero framework. DEZ maintains separate queues to track past actions and their delays, using them to accurately predict future states and make decisions accordingly.
The authors thoroughly evaluate DEZ on the Atari suite, considering both constant and stochastic delay settings. Their results show that DEZ significantly outperforms the baseline methods, including the previous state-of-the-art 'Delayed-Q' algorithm, in both delay scenarios.
他の言語に翻訳
原文コンテンツから
arxiv.org
深掘り質問