Core Concepts
Different encoding techniques, including Ordinal, One-Hot, Rarelabel, String Similarity, Summary, and Target encoding, have varying impacts on the performance of entity and context embeddings in tabular learning tasks. The results show that String Similarity encoding generally outperforms the commonly used Ordinal encoding, especially for multi-label classification problems.
Abstract
The work examines the effect of different encoding techniques on entity and context embeddings in tabular learning. It starts by discussing discretization methods, including unsupervised (k-means) and supervised (decision tree) approaches, to handle continuous variables.
The paper then explores various encoding techniques, including Ordinal, One-Hot, Rarelabel, String Similarity, Summary, and Target encoding. These methods are used to transform categorical features into a numerical representation suitable for machine learning algorithms.
The experimental setup involves preprocessing 10 datasets from the UCI Machine Learning Repository, both binary and multi-label classification tasks. The datasets are first discretized using a decision tree-based approach, and then the different encoding techniques are applied.
Two neural network models are implemented: the Entity model, which uses parallel embedding layers, and the Context model, which incorporates a transformer-based encoder. The models are trained and evaluated using the preprocessed data, and the F1-score is used as the primary performance metric.
The results show that String Similarity encoding generally outperforms the commonly used Ordinal encoding, especially on multi-label classification problems. One-Hot, Rarelabel, and String Similarity encodings also outperform Ordinal encoding on several datasets. However, the improved performance comes at the cost of increased computation time, particularly for high-cardinality features.
The paper concludes by suggesting future research directions, such as investigating the impact of encoding techniques on neural networks with both continuous and discrete inputs, understanding the challenges of target encoding, and analyzing how the encoders affect the class structures captured in the entity and context embeddings.
Stats
"Discretization is conducted using a decision tree model due to the advantages it offers as mentioned in the last paragraph of section II."
"To perform discretization, the mean accuracy divided by the standard deviation of the cross-validated model is used to choose a suited α."
"The remaining nodes represent the bin edges for interval construction, where each value of the continuous variable is assigned to the corresponding bin."
Quotes
"String similarity encoding compares class names in order to form a similarity matrix. While many methods exist to compare two strings with each other[28], the Jaro-Winkler similarity[29] will be given as an example."
"Returning loss, accuracy and the prediction probabilities to form metrics as well as keeping track of the training time builds the foundation for the evaluation process."