Conceptos Básicos
This research paper presents the first globally convergent online RLHF algorithm with neural network parameterization, addressing the distribution shift issue and providing theoretical convergence guarantees with state-of-the-art sample complexity.
Estadísticas
The achieved sample complexity is ǫ−7/2.
The current state-of-the-art sample complexity for vanilla actor-critic with neural parameterization is ǫ−3.