แนวคิดหลัก
ORSO is a novel approach that accelerates reward design in reinforcement learning by framing it as an online model selection problem, efficiently identifying effective shaping reward functions without human intervention.
สถิติ
ORSO achieves human-level performance more than twice as fast as the naive selection strategy.
ORSO consistently matches or exceeds human-designed rewards, particularly in more complex environments.
ORSO surpasses human-designed rewards when provided a budget of at least 10 times the number of iterations used to train with the human-engineered reward function.
คำพูด
"ORSO significantly improves sample efficiency, reduces computational time, and consistently identifies high-quality reward functions that produce policies comparable to those generated by domain experts through hand-engineered rewards."
"Our empirical results across various continuous control tasks using the Isaac Gym simulator demonstrate that ORSO identifies the best auxiliary reward function much faster (2× or more) than current methods."
"Moreover, ORSO consistently selects reward functions that are comparable to, and sometimes surpass, those designed by domain experts."