核心概念
Incorporating class information and feature correlation structures into the tabular data augmentation process can improve the performance of contrastive learning for downstream classification tasks.
要約
The paper proposes two improvements to the existing tabular data augmentation techniques used in contrastive learning:
Class-Conditioned Corruption:
When corrupting a feature value in the anchor row, instead of randomly sampling a replacement value from the entire table, the authors restrict the sampling to only rows that belong to the same class as the anchor row.
This ensures the generated views are more semantically similar to the anchor, as they share the same class identity.
The authors adopt a pseudo-labeling approach to obtain class estimates for the unlabeled rows during the contrastive pre-training stage.
Correlation-Based Feature Masking:
The authors explore selecting the subset of features to corrupt based on the feature correlation structure, instead of random selection.
They use XGBoost feature importance scores as a proxy for feature correlations, and sample the subset of highly correlated or uncorrelated features to corrupt.
The intuition is that corrupting correlated features will encourage the model to learn the underlying feature relationships, which can benefit the downstream classification task.
Extensive experiments on the OpenML-CC18 dataset show that the class-conditioned corruption approach consistently outperforms the conventional random corruption method. However, the correlation-based feature masking did not provide concrete improvements, likely due to the preprocessed nature of the benchmark datasets having limited feature correlation structures.
The paper highlights the importance of incorporating domain-specific knowledge, such as class information, into the data augmentation process for contrastive learning on tabular data. It also motivates further research into quantifying and leveraging feature correlation structures for effective tabular data augmentation.
統計
The paper reports classification accuracy and AUROC metrics on 30 datasets from the OpenML-CC18 benchmark.
引用
"Contrastive learning is mainly used to pre-train an encoder block, which maps the input raw data to an intermediate embedding space."
"Nonetheless, for the domain of tabular data, it is more challenging to design such effective data augmentation techniques."
"Recognizing the gap in tabular data augmentation techniques, we first propose an easy yet powerful improvement to the feature-value corruption technique, by incorporating the class information into the corruption procedure."