The author proposes an improved algorithm for adversarial linear mixture MDPs, focusing on unknown transitions and bandit feedback, achieving a regret bound that surpasses previous results.
Prioritized League Reinforcement Learning addresses challenges in large-scale heterogeneous multiagent systems by promoting cooperation and resolving sample inequality.
Effiziente Exploration in Low-Rank MDPs durch das VoX-Algorithmus.
Linear mixture MDPs algorithm improvement for adversarial settings.
Verbesserter Algorithmus für adversative lineare Misch-MDPs mit Bandit-Feedback und unbekanntem Übergang.
Slowly changing adversarial bandit algorithms can efficiently handle discounted Markov decision processes.
AdaMemento is a novel reinforcement learning framework that leverages past experiences and fine-grained exploration to improve policy optimization, particularly in sparse reward environments.
This paper introduces Matryoshka Policy Gradient (MPG), a novel policy gradient algorithm for fixed-horizon max-entropy reinforcement learning, and proves its global convergence to the optimal policy in the function approximation setting with log-linear policies.
This paper establishes improved gap-dependent regret bounds for UCB-Advantage and Q-EarlySettled-Advantage, two Q-learning algorithms employing variance estimators and reference-advantage decomposition, demonstrating their superior performance under benign MDP structures with a positive suboptimality gap.
High update-to-data ratios in off-policy reinforcement learning often lead to instability due to the value function's inability to generalize to unseen on-policy actions. This issue can be effectively mitigated by incorporating a small amount of model-generated on-policy data into the training process, as demonstrated by the MAD-TD method.