The paper introduces the Constraint-Conditioned Policy Optimization (CCPO) framework for versatile safe reinforcement learning. The key challenges addressed are training efficiency and zero-shot adaptation capability to unseen constraint thresholds.
The method consists of two integrated components:
Versatile Value Estimation (VVE): This module utilizes value function representation learning to estimate value functions for the versatile policy under unseen threshold conditions, enabling generalization to diverse unseen constraint thresholds.
Conditioned Variational Inference (CVI): This module aims to encode arbitrary threshold conditions during policy training, allowing the policy to achieve zero-shot adaptation to unseen thresholds without the need for behavior agents to collect data under corresponding conditions.
The theoretical analysis provides insights into the data efficiency and safety guarantees of the proposed approach. Comprehensive evaluations demonstrate that CCPO outperforms baseline methods in terms of both safety and task performance for varying constraint conditions, especially in high-dimensional state and action space tasks where baseline methods fail to realize safe adaptation.
toiselle kielelle
lähdeaineistosta
arxiv.org
Tärkeimmät oivallukset
by Yihang Yao,Z... klo arxiv.org 05-01-2024
https://arxiv.org/pdf/2310.03718.pdfSyvällisempiä Kysymyksiä