Singh, U., Chakraborty, S., Suttle, W. A., Sadler, B. M., Sahu, A. K., Shah, M., ... & Bedi, A. S. (2024). HIERARCHICAL PREFERENCE OPTIMIZATION: LEARNING TO ACHIEVE GOALS VIA FEASIBLE SUB-GOALS PREDICTION. arXiv preprint arXiv:2411.00361.
This paper introduces Hierarchical Preference Optimization (HPO), a novel approach to mitigate the challenges of non-stationarity and infeasible subgoal generation in hierarchical reinforcement learning (HRL) for complex robotic control tasks.
HPO leverages a bi-level optimization framework to address the nested structure of HRL, ensuring the higher-level policy generates feasible subgoals for the lower-level policy. It utilizes a token-level Direct Preference Optimization (DPO) objective, eliminating the need for pre-trained reference policies. The lower-level policy is trained using traditional reinforcement learning, while the higher-level policy learns from preferences, mitigating non-stationarity. The authors employ a primitive-in-the-loop approach to autonomously generate preferences using sparse environment rewards, reducing the reliance on human feedback. The effectiveness of HPO is evaluated on four challenging robotic tasks: maze navigation, pick and place, push, and a Franka kitchen environment.
HPO presents a promising solution for addressing key limitations in HRL, demonstrating significant performance improvements in complex robotic control tasks. The proposed bi-level optimization framework and preference-based learning approach effectively tackle non-stationarity and infeasible subgoal generation, paving the way for more efficient and robust HRL algorithms.
This research significantly contributes to the field of hierarchical reinforcement learning by addressing critical limitations that hinder its practical application in complex robotic tasks. The proposed HPO method offers a novel and effective solution for improving the stability and efficiency of HRL, potentially enabling the development of more sophisticated and autonomous robotic systems.
While HPO demonstrates promising results, future research could explore its application to a wider range of robotic tasks and environments with varying complexity. Additionally, investigating the impact of different preference elicitation methods and exploring alternative reward shaping techniques could further enhance HPO's performance and generalizability.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Utsav Singh,... at arxiv.org 11-04-2024
https://arxiv.org/pdf/2411.00361.pdfDeeper Inquiries