Offline Reinforcement Learning

سجل دخولك

رؤى - Offline Reinforcement Learning

오프라인 다중 에이전트 강화 학습에서 데이터 중심 접근법

오프라인 다중 에이전트 강화 학습에서 데이터의 특성이 알고리즘 성능에 큰 영향을 미치므로, 데이터에 대한 체계적인 접근이 필요하다.

Leveraging Unlabeled Data to Improve Offline Reinforcement Learning Across Two Domains

A novel offline reinforcement learning problem setting, Positive-Unlabeled Offline RL (PUORL), is introduced to effectively utilize domain-unlabeled data in scenarios with two distinct domains. An algorithmic framework is proposed that leverages positive-unlabeled learning to predict domain labels and integrate the domain-unlabeled data into policy training.

Generating Synthetic On-Policy Trajectories for Offline Reinforcement Learning via Policy-Guided Diffusion

Policy-guided diffusion generates synthetic trajectories that balance action likelihoods under both the target and behavior policies, leading to plausible trajectories with high target policy probability while retaining low dynamics error.

Diverse Randomized Value Functions: Eine nachweislich pessimistische Methode für Offline-Reinforcement-Learning

Die Methode der Diverse Randomized Value Functions (DRVF) schätzt die Verteilung der Q-Werte durch Verwendung von zufällig initialisierten Q-Ensembles und Diversitätsregularisierung ab. Dies führt zu einer robusten Unsicherheitsquantifizierung und ermöglicht eine nachweislich pessimistische Aktualisierung der Wertfunktion.

Diverse Randomized Value Functions: A Provably Pessimistic Approach for Offline Reinforcement Learning

The core message of this paper is to propose a novel strategy employing diverse randomized value functions to estimate the posterior distribution of Q-values, which provides robust uncertainty quantification and estimates lower confidence bounds (LCB) of Q-values. By applying moderate value penalties for out-of-distribution (OOD) actions, the proposed method fosters a provably pessimistic approach.

Compositional Conservatism: A Transductive Approach for Improving Offline Reinforcement Learning Performance

Compositional Conservatism with Anchor-seeking (COCOA) is a framework that pursues conservatism in the compositional input space of the policy and Q-function, independently and agnostically to the prevalent behavioral conservatism in offline reinforcement learning.

Efficient Offline Reinforcement Learning through Grid-Mapping Pseudo-Count Constraint

The authors propose a novel Grid-Mapping Pseudo-Count (GPC) method to accurately quantify uncertainty in continuous offline reinforcement learning, and develop the GPC-SAC algorithm by combining GPC with the Soft Actor-Critic framework to achieve better performance and lower computational cost compared to existing algorithms.

Lernen aus spärlichen Offline-Datensätzen durch konservative Dichteschätzung

Konservative Dichteschätzung (CDE) verbessert die Leistung in Offline-RL durch die Bewältigung von Extrapolationsfehlern und Datenknappheit.

حول

المنتجات

الموارد