核心概念
This paper argues that LayerNorm with L2 regularization can stabilize off-policy Temporal Difference (TD) learning, eliminating the need for target networks and replay buffers, and proposes PQN, a simplified deep Q-learning algorithm that leverages parallelized environments for efficient and stable training.
Gallici, M., Fellows, M., Ellis, B., Pou, B., Masmitja, I., Foerster, J. N., & Martin, M. (2024). Simplifying Deep Temporal Difference Learning. arXiv preprint arXiv:2407.04811v2.
This paper investigates the potential of regularization techniques, specifically LayerNorm and L2 regularization, to stabilize off-policy Temporal Difference (TD) learning in deep reinforcement learning. The authors aim to develop a simplified and efficient deep Q-learning algorithm that eliminates the need for target networks and replay buffers, which are commonly used to stabilize TD learning but introduce complexity and computational overhead.