Entropy-Regularized Token-Level Policy Optimization for Large Language Models
The author introduces Entropy-Regularized Token-level Policy Optimization (ETPO) as a method to optimize Large Language Models at the token level, addressing challenges in RL fine-tuning. ETPO decomposes actions into tokens and provides fine-grained credit assignment, improving performance.