Core Concepts
An autonomous non-monolithic agent with multi-mode exploration based on an options framework to enable flexible and adaptive exploration-exploitation behavior.
Abstract
This paper introduces an autonomous non-monolithic agent with multi-mode exploration based on an options framework. The key highlights are:
The agent uses a hierarchical reinforcement learning (HRL) model with three levels - Top, Middle, and Low. The Top level policy (πPPO
T) chooses between exploration modes (uniform random, PPO) and exploitation (TD3) as options.
The agent has more entropy choices for exploration, with two exploration modes (πRND
M, πPPO
M) and one exploitation policy (πTD3
M) in the Middle level.
The agent uses a guided exploration strategy, modifying the reward based on a preset parameter (αg_expl-mode) to encourage exploration or exploitation.
An online evaluation process is used to ensure a robust optimal policy, where the loss of the Top level policy (πPPO
T) is modified based on the success rate (S_E) of the Middle level policy (πTD3
M).
The experiments show the proposed agent outperforms a reference non-monolithic exploration method and a monolithic exploration policy (HIRO) on the Ant Push and Ant Fall tasks in the OpenAI Gym environment.
Stats
The agent's exploration and exploitation counts are analyzed to understand the switching behavior.
The total steps of the three Middle level policies are compared:
Total_Step(πTD3
M) >> Total_Step(πPPO
M) > Total_Step(πRND
M)
Quotes
"Our model just consumes PPO for an exploration mode so that it will be discarded at the end of training. Our model takes care of only off-policy, TD3, as a final target policy."
"The value of αg_expl-mode is differently or sometimes equally preset according to the type of g_expl-mode as αuniform random > αppo > or equal to αtd3."