Centrala begrepp
Efficiently optimize policies using diffusion behavior modeling in offline RL.
Sammanfattning
Recent advancements in offline reinforcement learning have highlighted the potential of diffusion modeling for representing heterogeneous behavior policies. However, the slow sampling speed of diffusion policies poses a challenge. Score Regularized Policy Optimization (SRPO) proposes an efficient method to extract deterministic inference policies from critic and behavior models, avoiding the computationally intensive diffusion sampling scheme. By leveraging pretrained diffusion behavior models, SRPO optimizes policy gradients with the behavior distribution's score function, enhancing generative capabilities while improving computational efficiency significantly. Experimental results demonstrate a substantial boost in action sampling speed and reduced computational cost compared to leading diffusion-based methods.
Statistik
Extensive results on D4RL tasks show that our method boosts action sampling speed by more than 25 times compared with various leading diffusion-based methods in locomotion tasks.
SRPO enjoys a more than 25× boost in action sampling speed and less than 1% of computational cost for evaluation compared with several leading diffusion-based methods while maintaining similar overall performance in locomotion tasks.
The action sampling speed of SRPO is 25 to 1000 times faster than that of other diffusion-based methods.
Citat
"SRPO entirely circumvents the computationally demanding action sampling scheme associated with the diffusion process."
"Our method enjoys powerful generative capabilities of diffusion modeling while completely circumventing the computationally intensive and time-consuming diffusion sampling scheme."
"SRPO maintains high computational efficiency, especially fast inference speed, while enabling the use of a powerful diffusion model."