Conceitos essenciais
A novel approach, DrS (Dense reward learning from Stages), for learning reusable dense rewards for multi-stage robotic manipulation tasks in a data-driven manner, effectively reducing human effort in reward engineering.
Resumo
The paper proposes a novel approach, DrS (Dense reward learning from Stages), for learning reusable dense rewards for multi-stage robotic manipulation tasks. The key insights are:
Leveraging the stage structure of the task, DrS learns a high-quality dense reward from sparse rewards and demonstrations (if given). The learned rewards can be reused in unseen tasks, reducing the human effort for reward engineering.
In single-stage tasks, DrS trains a discriminator to classify success and failure trajectories, using the sparse reward signal as supervision. This ensures the discriminator continues to learn meaningful information even at convergence, unlike previous adversarial imitation learning methods.
For multi-stage tasks, DrS trains a separate discriminator for each stage, where the discriminator for stage k aims to distinguish trajectories that reach beyond stage k from those that only reach up to stage k. The stage-specific discriminators are then combined to form the final dense reward.
Extensive experiments on 1,000+ task variants from three physical robot manipulation task families (Pick-and-Place, Turn Faucet, Open Cabinet Door) demonstrate that the learned rewards can be reused in unseen tasks, resulting in improved performance and sample efficiency of RL algorithms compared to using sparse rewards. In certain tasks, the learned rewards even achieve comparable performance to human-engineered reward functions.
The proposed approach significantly reduces the human effort required for reward engineering. For example, while the human-engineered reward for "Open Cabinet Door" involves over 100 lines of code, 10 candidate terms, and tons of "magic" parameters, DrS only requires two boolean functions as stage indicators.
Estatísticas
The object is in close proximity to the goal position.
The robot arm and the object remain stationary.
The handle reaches a target angle.
The door is opened to a sufficient degree and remains stationary.
Citações
"The success of many reinforcement learning (RL) techniques heavily relies on dense reward functions, which are often tricky to design by humans due to heavy domain expertise requirements and tedious trials and errors."
"Ideally, the learned reward will be reused to efficiently solve new tasks that share similar success conditions with the task used to learn the reward."
"Our approach involves incorporating sparse rewards as a supervision signal in lieu of the original signal used for classifying demonstration and agent trajectories."