Core Concepts
The defender develops a cost-aware and threat-adaptive defensive strategy against unknown attacks on traffic routing decisions using a reinforcement learning algorithm based on minimax least-square policy iteration.
Abstract
The content presents a reinforcement learning (RL) algorithm to compute a defensive strategy for a traffic routing system facing potential attacks. The system consists of parallel queues with Poisson arrivals and exponential service times. An attacker can manipulate the optimal routing decisions to increase traffic congestion, while the defender can secure the correct routing decisions at a technological cost.
The key highlights are:
The defender has no prior knowledge about the attacker's strategy, so the algorithm needs to be threat-adaptive.
The defensive strategy should balance the costs of traffic congestion and defensive efforts, so it needs to be cost-aware.
The authors extend the least-square policy iteration (LSPI) algorithm to the Markov security game setting, incorporating a minimax problem in the policy improvement step to compute the Markov perfect equilibrium.
The authors provide a theoretical bound on the prediction error of the proposed algorithm by decomposing it into the projection error and the sampling error, with the latter further divided into the error for the approximate value function and the true value function.
The sampling error bounds rely on the concentration properties of the reward function and the transition dynamics, established based on the queuing system model.
Stats
๐๐ฟ+ ๐๐
๐(1โ๐พ)
๐พ๐ฟ2๐๐๐๐ฅ
1โ๐พ