Core Concepts
A sample-efficient method that learns a robust reward function from a limited amount of ranked suboptimal demonstrations consisting of partial-view point cloud observations, and then learns a policy by optimizing the learned reward function using reinforcement learning.
Abstract
The authors propose a novel preference-based reinforcement learning approach to address the challenges of partial observability and suboptimal demonstrations in automating surgical tasks, particularly electrocautery.
Key highlights:
- They use a point cloud autoencoder to learn a low-dimensional feature representation of the partial-view point cloud observations.
- They leverage Trajectory-ranked Reward Extrapolation (T-REX) to learn a reward function that explains the pairwise preferences over suboptimal demonstrations.
- They then use reinforcement learning to optimize the learned reward function and obtain a robust policy.
- Experiments in simulation show their method outperforms pure imitation learning, achieving 80% task success rate.
- They also demonstrate a proof-of-concept physical experiment on a real surgical robot, achieving 5 successful trials out of 7.
- The approach reduces the need for near-optimal demonstrations and enables surgical policy learning from qualitative human evaluations.
Stats
The reward function is defined as:
R(eef, B) = max_b∈B 1 / (||eef - b||^2 + ε)
where eef is the 3D end-effector position, b is the 3D position of an attachment point, B is the set of attachment points, and ε is a small number.
Quotes
"Automating robotic surgery via learning from demonstration (LfD) techniques is extremely challenging. This is because surgical tasks often involve sequential decision-making processes with complex interactions of physical objects and have low tolerance for mistakes."
"Our proposed approach uses pairwise preference labels over suboptimal trajectory data to capture the demonstrator's intent in the form of a learned reward function that can be optimized via reinforcement learning to yield a robust robot policy."