Core Concepts
While learning in Markov Games against adaptive adversaries is generally statistically hard, efficient learning is possible with sublinear policy regret by leveraging the opponent's consistency in responding to similar learner policies.
Stats
Any learner must incur a linear policy regret of Ω(T) against an adaptive adversary with unbounded memory.
Learning in Markov games requires an exponential number of samples, Ω((SA)H/ǫ2), to obtain an ǫ-suboptimal policy regret even against an oblivious adversary (memory length m=0).
Against a 1-memory bounded and stationary adversary, the policy regret necessarily scales polynomially with the cardinality of the learner's policy set, Ω(min{T, |Π|}).
OPO-OMLE achieves a policy regret bound of ˜O(H3S2AB + √(H5SA2BT)) against 1-memory bounded, stationary, and consistent adversaries.
APE-OVE achieves a policy regret bound of ˜O((m −1)H2SAB + √(H3SAB(SAB(H + √S) + H2) * √(T/d∗))) against m-memory bounded, stationary, and consistent adversaries.
Quotes
"Policy regret is not adequate to study adaptive opponents as it does not take into account the counterfactual response of the opponents."
"We argue that the [consistent behavior] definition above is natural if we are to consider opponents that are self-interested strategic agents, and not simply a malicious adversary."