Vlearn: Efficient Off-Policy Learning without State-Action-Value Function
Vlearn introduces an efficient off-policy trust region optimization approach that eliminates the need for an explicit state-action-value function, leading to improved performance and stability in high-dimensional action spaces.