Belangrijkste concepten
The authors propose a novel theoretical framework for preference-based reinforcement learning (PbRL) that decouples the interaction with the environment and the collection of human feedback. This allows for efficient learning of the optimal policy under linear reward parametrization and unknown transitions.
Samenvatting
The paper presents a new theoretical approach for preference-based reinforcement learning (PbRL) that addresses the gap between existing theoretical work and practical algorithms. The key ideas are:
Reward-Agnostic Exploration: The algorithm first collects exploratory state-action trajectories from the environment without any human feedback. This exploratory data can then be reused to learn different reward functions.
Decoupling Interaction and Feedback: The algorithm separates the steps of collecting exploratory data and obtaining human feedback, unlike existing works that require human feedback at every iteration. This simplifies the practical implementation and reduces the sample complexity for human feedback.
Theoretical Guarantees: The authors provide sample complexity bounds for their algorithm, showing that it requires less human feedback to learn the optimal policy compared to prior theoretical work, especially when the transitions are unknown but have a linear or low-rank structure.
Action-Based Comparison: The authors also investigate a variant of their algorithm that handles action-based comparison feedback, where the human provides preferences over individual actions rather than full trajectories. This setting can lead to better sample complexity when the advantage function of the optimal policy is bounded.
The paper demonstrates how careful algorithmic design can bridge the gap between theoretical PbRL and practical applications, by leveraging reward-agnostic exploration and decoupling data collection from human feedback.