Główne pojęcia
This paper proposes and analyzes an ε-policy gradient algorithm that integrates model-based and model-free reinforcement learning approaches to efficiently learn an optimal pricing policy for online pricing problems.
Streszczenie
The paper presents an ε-policy gradient (ε-PG) algorithm for online pricing problems, which combines model-based and model-free reinforcement learning approaches. The key components are:
Model-based approach: The algorithm assumes the customer response distribution follows a parametric form and estimates the unknown parameter using an empirical risk minimization problem after each trial.
Model-free approach: The algorithm updates the pricing policy using a policy gradient method, which replaces the greedy exploitation step in the ε-greedy algorithm with a gradient descent step. This facilitates learning via model inference and enhances sample efficiency.
Exploration-exploitation tradeoff: The algorithm explores the environment with probability ε and exploits the current policy with probability 1-ε. The exploration probability ε is reduced at a suitable rate as the learning proceeds.
The paper analyzes the regret of the proposed ε-PG algorithm by quantifying the exploration cost in terms of the exploration probability ε and the exploitation cost in terms of the gradient descent optimization and gradient estimation errors. Under suitable assumptions, the algorithm achieves an expected regret of order O(√T) (up to a logarithmic factor) over T trials.
The proposed approach tackles several drawbacks of existing contextual bandit and online optimization algorithms, such as the need for greedy policy computation, structural assumptions on the reward function, and cold-start issues when the objective or environment changes.
Statystyki
The paper does not provide any specific numerical data or statistics. It focuses on the theoretical analysis of the proposed ε-policy gradient algorithm.
Cytaty
There are no direct quotes from the content that are particularly striking or support the key logics.