核心概念
This research paper introduces a novel switching-based reinforcement learning algorithm that guarantees the probabilistic satisfaction of temporal logic constraints throughout the learning process, balancing constraint satisfaction with reward maximization.
統計
The robot's action set includes N, NE, E, SE, S, SW, W, NW, and Stay.
The intended transition probability for each action (except "Stay") is 90%.
Unintended transitions occur with a probability of 10%.
The environment is an 8x8 grid.
Light gray cells yield a reward of 1.
Dark gray cells yield a reward of 10.
All other cells yield a reward of 0.
The TWTL formula for the pickup and delivery task is [H1P][0,20] · ([H1D1][0,20] ∨[H1D2][0,20]) · [H1Base][0,20].
Each episode lasts for 62 time steps.
The training consists of 40,000 episodes in Case 1.
Case 2 and 3 use a fixed number of episodes (Nepisode = 1000).
The diminishing ε-greedy policy starts with εinit = 0.7 and ends with εfinal = 0.0001.
The learning rate is set to 0.1.
The discount factor is set to 0.95.
The z-score is set to 2.58 for a high confidence level.
引用
"Conventional formulations of constrained RL (e.g. [1], [2], [3]) focus on maximizing reward functions while keeping some cost function below a certain threshold."
"Driven by the need for a scalable solution that offers desired probabilistic constraint satisfaction guarantees throughout the learning process (even in the first episode of learning), we propose a novel approach that enables the RL agent to alternate between two policies during the learning process."
"The proposed algorithm estimates the satisfaction rate of following the first policy and adaptively updates the switching probability to balance the need for constraint satisfaction and reward maximization."