Recent advancements in offline reinforcement learning have highlighted the potential of diffusion modeling for representing heterogeneous behavior policies. However, the slow sampling speed of diffusion policies poses a challenge. Score Regularized Policy Optimization (SRPO) proposes an efficient method to extract deterministic inference policies from critic and behavior models, avoiding the computationally intensive diffusion sampling scheme. By leveraging pretrained diffusion behavior models, SRPO optimizes policy gradients with the behavior distribution's score function, enhancing generative capabilities while improving computational efficiency significantly. Experimental results demonstrate a substantial boost in action sampling speed and reduced computational cost compared to leading diffusion-based methods.
To Another Language
from source content
arxiv.org
Principais Insights Extraídos De
by Huayu Chen,C... às arxiv.org 03-18-2024
https://arxiv.org/pdf/2310.07297.pdfPerguntas Mais Profundas