The paper proposes a new framework called Compositional Conservatism with Anchor-seeking (COCOA) for offline reinforcement learning. The key insights are:
Offline RL faces the problem of distributional shifts, where the states and actions encountered during policy execution may not be in the training dataset distribution. Existing solutions often involve incorporating conservatism into the policy or value function.
COCOA approaches the same objectives of conservatism but from a different perspective. It pursues conservatism in the compositional input space of the policy and Q-function, rather than the behavioral space.
COCOA builds upon the transductive reparameterization (bilinear transduction) proposed by Netanyahu et al. (2023), which decomposes the input variable (state) into an anchor and a delta. COCOA seeks both in-distribution anchors and deltas by utilizing a learned reverse dynamics model.
The anchor-seeking policy is trained to find anchors close to the seen area of the state space, encouraging the agent to stay within the known distribution.
COCOA is applied to four state-of-the-art offline RL algorithms (CQL, IQL, MOPO, MOBILE) and evaluated on the D4RL benchmark, where it generally improves the performance of each algorithm.
An ablation study shows the importance of the anchor-seeking component, as a variant without it performs worse than the original baseline algorithms.
Sang ngôn ngữ khác
từ nội dung nguồn
arxiv.org
Thông tin chi tiết chính được chắt lọc từ
by Yeda Song,Do... lúc arxiv.org 04-09-2024
https://arxiv.org/pdf/2404.04682.pdfYêu cầu sâu hơn