Core Concepts
One-hot encoding of categorical variables can lead to a product-of-Bernoullis (PoB) assumption in Naive Bayes classifiers, rather than the correct categorical Naive Bayes model. This can result in differences in the posterior probabilities and, in some cases, the maximum a posteriori (MAP) class assignment.
Abstract
The paper investigates the consequences of encoding a K-valued categorical variable using one-hot encoding when using a Naive Bayes classifier. This gives rise to a product-of-Bernoullis (PoB) assumption, rather than the correct categorical Naive Bayes classifier.
The analysis shows that the Q^-j factors introduced in the PoB model can make the ratio of the probabilities fj(θc)/fj(θd) more extreme than the ratio of the original parameters θjc/θjd. This is because the Q^-j factors are bounded between the minimum value of θj and a maximum value that is greater than θj.
Experiments using Naive Bayes classifiers with C=4 classes and K=3, 6, and 10 states, with the θ vectors drawn from Dirichlet distributions with α=1 and α=1/K, show that the maximum posterior probability under the PoB model is usually higher than under the exact categorical model. The percentage of cases where the maximum probability under the PoB model is higher than for the exact (categorical) model are 82.0%, 72.3% and 74.7% for K=3, 6, 10 and α=1, and 78.0%, 78.1% and 76.4% respectively for α=1/K.
The experiments also show that the fraction of cases where the Naive Bayes classifiers disagree on the MAP class assignment decreases as K increases, and is somewhat higher for the sparser α=1/K distribution than for α=1.
The analysis and experiments highlight the importance of understanding the encoding of variables, as using a "bare" table without metadata may lead to a mis-application of the product-of-Bernoullis model.
Stats
The paper does not provide any specific numerical data or metrics, but rather presents mathematical analysis and experimental results.
Quotes
"If these bits are naïvely treated as K independent Bernoulli variables, then the classification probabilities will not be correctly computed under the Naïve Bayes model."
"The differences between the the posterior probabilities computed under the categorical model of eq. 2 and the PoB of eq. 3 is the Q^-j factors that appear in the numerator and denominator of the latter."
"The observations above are consistent with the idea that the PoB assumption "overcounts" the evidence from the x variable, relative to the correct categorical encoding."