Core Concepts
Contrastive self-supervised learning can provably recover the underlying true transition dynamics, enabling efficient exploration in reinforcement learning.
Abstract
The paper proposes a reinforcement learning algorithm that integrates contrastive self-supervised learning for representation learning. The key highlights are:
The algorithm learns low-dimensional representations of states and actions by minimizing a contrastive loss, which is shown to recover the true transition dynamics under the low-rank MDP assumption.
The learned representations are then used to construct an upper confidence bound (UCB) bonus term, which enables efficient exploration in the online RL setting.
Theoretical analysis is provided to show that the proposed algorithm achieves a sample complexity of Õ(1/ε^2) for attaining an ε-approximate optimal policy in MDPs.
The algorithm and theory are further extended to the zero-sum Markov game setting, where the representations are used to construct upper and lower confidence bounds (ULCB) for efficient exploration.
Empirical studies are conducted to demonstrate the efficacy of the UCB-based contrastive learning method for reinforcement learning.
Stats
The paper does not provide any specific numerical data or statistics. The theoretical analysis focuses on establishing sample complexity bounds for the proposed algorithms.