This paper introduces Offline RL-VLM-F, a novel system that leverages vision-language models (VLMs) to automatically generate reward labels for unlabeled datasets, enabling offline reinforcement learning for complex real-world robotics tasks, such as robot-assisted dressing, and outperforming existing baselines in various simulated manipulation tasks.
C-LAP, a novel model-based offline reinforcement learning method, leverages a generative model of joint state-action distribution and a constrained policy optimization approach to enhance performance and mitigate value overestimation, particularly excelling in scenarios with visual observations.
The hypercube policy regularization framework improves offline reinforcement learning by allowing agents to explore actions corresponding to similar states within a hypercube, striking a balance between conservatism and aggressiveness for better policy learning.
This paper proposes a novel framework for domain adaptation in offline reinforcement learning (RL) with limited target samples, theoretically analyzing the trade-off between leveraging a large, related source dataset and a limited target dataset, and providing empirical validation on the Procgen benchmark.
Diffusion Trusted Q-Learning (DTQL) leverages the expressiveness of diffusion policies for behavior cloning while employing a novel diffusion trust region loss to guide a computationally efficient one-step policy for superior performance in offline reinforcement learning tasks.
Diffusion-DICE is a novel offline reinforcement learning algorithm that leverages diffusion models to transform the behavior policy distribution into an optimal policy distribution, achieving state-of-the-art performance by minimizing error exploitation in value function approximation.
LPT, a novel generative model, effectively performs planning in offline reinforcement learning by leveraging a latent variable to connect trajectory generation with final returns, achieving temporal consistency and outperforming existing methods in challenging tasks.
本文提出了一種名為離線行為蒸餾(OBD)的新方法,旨在從大量的次優強化學習數據中提取精簡的專家行為數據,從而實現快速且高效的策略學習。
Mamba, a linear-time sequence model, can be effectively adapted for trajectory optimization in offline reinforcement learning, achieving comparable or superior performance to transformer-based methods while using significantly fewer parameters.
Branch Value Estimation (BVE) is a novel offline reinforcement learning method that effectively addresses the challenges of learning in large, discrete combinatorial action spaces by representing the action space as a tree and learning to evaluate only a small subset of actions at each timestep.