오프라인 다중 에이전트 강화 학습에서 데이터의 특성이 알고리즘 성능에 큰 영향을 미치므로, 데이터에 대한 체계적인 접근이 필요하다.
A novel offline reinforcement learning problem setting, Positive-Unlabeled Offline RL (PUORL), is introduced to effectively utilize domain-unlabeled data in scenarios with two distinct domains. An algorithmic framework is proposed that leverages positive-unlabeled learning to predict domain labels and integrate the domain-unlabeled data into policy training.
Policy-guided diffusion generates synthetic trajectories that balance action likelihoods under both the target and behavior policies, leading to plausible trajectories with high target policy probability while retaining low dynamics error.
Die Methode der Diverse Randomized Value Functions (DRVF) schätzt die Verteilung der Q-Werte durch Verwendung von zufällig initialisierten Q-Ensembles und Diversitätsregularisierung ab. Dies führt zu einer robusten Unsicherheitsquantifizierung und ermöglicht eine nachweislich pessimistische Aktualisierung der Wertfunktion.
The core message of this paper is to propose a novel strategy employing diverse randomized value functions to estimate the posterior distribution of Q-values, which provides robust uncertainty quantification and estimates lower confidence bounds (LCB) of Q-values. By applying moderate value penalties for out-of-distribution (OOD) actions, the proposed method fosters a provably pessimistic approach.
Compositional Conservatism with Anchor-seeking (COCOA) is a framework that pursues conservatism in the compositional input space of the policy and Q-function, independently and agnostically to the prevalent behavioral conservatism in offline reinforcement learning.
The authors propose a novel Grid-Mapping Pseudo-Count (GPC) method to accurately quantify uncertainty in continuous offline reinforcement learning, and develop the GPC-SAC algorithm by combining GPC with the Soft Actor-Critic framework to achieve better performance and lower computational cost compared to existing algorithms.
Konservative Dichteschätzung (CDE) verbessert die Leistung in Offline-RL durch die Bewältigung von Extrapolationsfehlern und Datenknappheit.