Khái niệm cốt lõi
This research paper presents the first globally convergent online RLHF algorithm with neural network parameterization, addressing the distribution shift issue and providing theoretical convergence guarantees with state-of-the-art sample complexity.
Thống kê
The achieved sample complexity is ǫ−7/2.
The current state-of-the-art sample complexity for vanilla actor-critic with neural parameterization is ǫ−3.