Core Concepts
Random forests can be used to effectively model and estimate discrete choices, providing an interpretable and flexible approach that outperforms existing methods.
Abstract
The paper introduces a data-driven framework that combines machine learning with interpretable behavioral models to model and estimate discrete choices. The key idea is to use random forests, a popular machine learning algorithm, to fit the data and predict future customer behavior.
The main highlights and insights are:
Theoretical analysis:
Random forests can consistently recover any discrete choice model (DCM) underlying the data as the sample size increases.
Random forests can be viewed as adaptive nearest neighbors, whose performance is explained by the distance to the nearest neighbors in the training data, the continuity of the underlying DCM, and the sampling error.
The splitting criterion used by random forests, such as the Gini index and information gain ratio, is intrinsically connected to the preference ranking of customers.
Practical advantages:
Random forests can capture complex behavioral patterns that elude other models, such as irregularity and sequential searches.
Random forests can handle nonstandard historical data formats, a major challenge in practice.
Random forests can measure product importance based on how frequently a random customer would make decisions depending on the presence of the product.
Random forests can incorporate price information and customer features, making them compatible with personalized online retailing.
The numerical experiments using synthetic and real data show that using random forests to estimate customer choices can outperform existing methods, especially when the training data is large.
Stats
"The difference in the choice probabilities of two neighboring assortments with distance one is c/N."
"When an assortment has O(|S|^2 · N^2/(log N)^2) samples, the error in using the frequencies to approximate the choice probabilities is at most O(log N/(N · |S|))."
Quotes
"Random forests can accurately predict the choice probability of any DCM, given that the firm has offered the assortment many times."
"The splitting criterion used by random forests, such as the Gini index and information gain ratio, is intrinsically connected to the preference ranking of customers."