核心概念
Linear mixture MDPs algorithm improvement for adversarial settings.
統計
"Our result strictly improves the previous best-known e O(dS2√ K + √ HSAK) result in Zhao et al. (2023a) since H ≤ S holds by the layered MDP structure."
"Our algorithm attains e O(d √ HS3K + √ HSAK) regret, strictly improving the e O(dS2√ K + √ HSAK) regret of Zhao et al. (2023a) since H ≤ S by the layered MDP structure."
"Our innovative use of techniques from dynamic assortment problems to mitigate estimation errors in RL theory is novel and may provide helpful insights for future research."
引用
"Our advancements are primarily attributed to (i) a new least square estimator for the transition parameter that leverages the visit information of all states, as opposed to only one state in prior work, and (ii) a new self-normalized concentration tailored specifically to handle non-independent noises."
"Our algorithm is similar to that of Zhao et al. (2023a): we first estimate the unknown transition parameter and construct corresponding confident sets."