Core Concepts
Sort & Slice outperforms hash-based folding for ECFP substructure pooling, offering a simple yet effective feature selection strategy.
Abstract
The article introduces Sort & Slice as an alternative to hash-based folding for ECFP substructures. It provides a detailed comparison of different substructure-pooling methods, highlighting the superior performance of Sort & Slice. The study includes a comprehensive computational evaluation across various molecular property prediction tasks, data splitting techniques, and machine-learning models.
Introduction to Extended-Connectivity Fingerprints (ECFPs) and hash-based folding.
Description of Sort & Slice as a substructure-pooling method for ECFPs.
Comparison of Sort & Slice with hash-based folding, filtering, and mutual information maximisation.
Experimental evaluation of substructure-pooling techniques for molecular property prediction.
Recommendations for the adoption of Sort & Slice in place of hash-based folding.
Stats
Sort & Slice robustly outperforms hash-based folding.
Sort & Slice offers a simple yet effective feature selection strategy.
Quotes
"Sort & Slice first sorts ECFP substructures according to their relative prevalence and then slices away infrequent substructures."
"Sort & Slice outperforms both advanced supervised substructure-selection schemes, filtering and mutual-information maximisation."