Core Concepts
This paper presents a novel off-policy reinforcement learning algorithm called AFU (Actor-Free Updates) that solves the challenging "max-Q problem" in Q-learning for continuous action spaces using regression and conditional gradient scaling. AFU has an actor but its critic updates are entirely independent from it, allowing the actor to be chosen freely.
Abstract
The paper introduces a new off-policy reinforcement learning algorithm called AFU (Actor-Free Updates) that addresses the "max-Q problem" in Q-learning for continuous action spaces.
The key highlights are:
AFU has a critic (Q-function) and an actor, but the critic updates are entirely independent from the actor, unlike state-of-the-art actor-critic methods.
The critic updates are derived from a novel adaptation of Q-learning to continuous action spaces, using regression and conditional gradient scaling to solve the max-Q problem.
In the initial version, AFU-alpha, the actor is trained using the same stochastic approach as in Soft Actor-Critic (SAC).
The authors then study a simple failure mode of SAC and propose a modified version, AFU-beta, that uses the value function trained by regression to guide the actor updates and make them less prone to local optima.
Experimental results on a benchmark of 7 MuJoCo tasks show that both AFU-alpha and AFU-beta are competitive in sample-efficiency with state-of-the-art actor-critic methods like TD3 and SAC, while departing from the actor-critic perspective.
The authors believe that AFU could open up new avenues for off-policy reinforcement learning algorithms applied to continuous control problems.