A Dynamical Clipping Approach with Task Feedback for Proximal Policy Optimization
The author proposes a novel approach, Pb-PPO, utilizing a multi-armed bandit algorithm to dynamically adjust the clipping bound during proximal policy optimization, aligning with task preferences or human feedback.