Alapfogalmak
An efficient method for searching the space of linear utility functions to approximate the Pareto front in multi-objective reinforcement learning problems.
Kivonat
The content describes a method for efficiently solving multi-objective reinforcement learning (MORL) problems by decomposing the problem into a set of scalar reinforcement learning sub-problems. The key aspects of the proposed approach, named UCB-MOPPO, are:
-
Decomposition of the MORL problem into scalar RL sub-problems:
- The overall scalarisation weight simplex is decomposed into K sub-spaces.
- A separate policy is trained for each sub-problem by conditioning it on scalarisation vectors sampled from the associated sub-space.
- This two-layer decomposition allows different policies to specialise in different sub-spaces of the scalarisation vector space.
-
Scalarisation-vector-conditioned Actor-Critic:
- Both the policy network and the value network are conditioned on the scalarisation vector.
- This allows a single policy to express different trade-offs between objectives by generalising to a neighbourhood of scalarisation vectors.
-
Surrogate-assisted maximisation of CCS hypervolume:
- An acquisition function based on Upper Confidence Bound (UCB) is used to select the scalarisation vectors to train on from each sub-space.
- At each stage of the training process, the selected scalarisation vectors are those expected to maximise the hypervolume of the resulting Convex Coverage Set (CCS) the most.
The proposed UCB-MOPPO method is shown to outperform various MORL baselines on MuJoCo benchmark problems across different random seeds. It achieves significantly higher hypervolume than the PGMORL baseline, while requiring fewer policies to be maintained, making it suitable for resource-constrained environments.
Statisztikák
The content does not provide any specific numerical data or metrics. It focuses on describing the proposed UCB-MOPPO method and comparing its performance to baseline methods on MuJoCo benchmark problems.
Idézetek
The content does not contain any direct quotes that are relevant to the key logics.