Core Concepts
A novel real-time recurrent reinforcement learning (RTRRL) approach that combines a biologically plausible Meta-RL RNN architecture, a TD(λ) actor-critic algorithm, and a random-feedback local-online (RFLO) optimization technique to solve discrete and continuous control tasks in partially observable Markov decision processes (POMDPs).
Abstract
The paper proposes a novel real-time recurrent reinforcement learning (RTRRL) approach that aims to be biologically plausible. RTRRL consists of three key components:
Meta-RL RNN Architecture: A recurrent neural network (RNN) architecture that implements an actor-critic algorithm on its own, inspired by the interplay between the dorsal and ventral striatum in the basal ganglia.
TD(λ) Actor-Critic Algorithm: An actor-critic algorithm that uses temporal difference (TD) learning and Dutch eligibility traces to train the weights of the Meta-RL network.
RFLO Optimization: A biologically plausible random-feedback local-online (RFLO) algorithm for computing the gradients of the network parameters, avoiding weight transport and ensuring locality of gradient information.
The authors compare RTRRL with popular but biologically implausible RL algorithms that use backpropagation through time (BPTT) or real-time recurrent learning (RTRL) for gradient computation. The results show that RTRRL with RFLO can outperform BPTT-based methods, especially on tasks requiring exploration in unfavorable environments. RTRRL is grounded in neuroscience and provides a model of reward-based learning in the human brain.