The paper introduces the DG-MORL algorithm, which addresses the challenges in multi-objective reinforcement learning (MORL) by leveraging prior demonstrations as guidance. The key highlights are:
Demonstration-Preference Alignment: The algorithm computes corner weights to align the demonstrations with user preferences, addressing the demonstration-preference misalignment issue.
Self-Evolving Mechanism: A self-evolving mechanism is introduced to continuously update and improve the guiding demonstrations, mitigating the impact of sub-optimal initial demonstrations and addressing the demonstration deadlock problem.
Multi-Stage Curriculum: A multi-stage curriculum is implemented to facilitate a smooth transition from the guide policy to the exploration policy, addressing the policy shift challenge.
Empirical Evaluation: Extensive experiments on benchmark MORL environments demonstrate the superiority of DG-MORL over state-of-the-art MORL algorithms in terms of learning efficiency, final performance, and robustness.
Theoretical Analysis: The paper provides theoretical insights into the lower and upper bounds of the sample efficiency of the DG-MORL algorithm.
Overall, the DG-MORL algorithm effectively leverages prior demonstrations to enhance exploration and policy learning in complex MORL tasks, outperforming existing methods.
Na inny język
z treści źródłowej
arxiv.org
Głębsze pytania