Core Concepts
Efficient streaming and parallel algorithms are provided to find the optimal split point that minimizes the mean squared error for regression or the misclassification rate for classification in decision tree learning.
Abstract
The paper presents efficient algorithms for finding the optimal split point in decision tree learning for both regression and classification problems in streaming and massively parallel computation (MPC) settings.
For regression:
A 1-pass deterministic algorithm that uses O~(D) space and time, where D is the number of distinct feature values, to find the optimal split.
A 2-pass algorithm that with high probability uses O~(1/ε) space and time, and computes a split with mean squared error at most OPT + ε.
An O(log N)-pass algorithm that with high probability uses O~(1/ε^2) space and time, and computes a split with mean squared error at most (1+ε)OPT.
For classification:
A 1-pass algorithm that with high probability uses O~(1/ε) space and time, and computes a split with misclassification rate at most OPT + ε.
An O(log N)-pass algorithm that with high probability uses O~(1/ε^2) space and time, and computes a split with misclassification rate at most (1+ε)OPT.
For categorical classification, the paper also provides a 1-pass algorithm that uses O~(N/ε) space and time to find a partition of the feature space that achieves misclassification rate at most OPT + ε.
The algorithms are designed to work in streaming and massively parallel computation settings, providing efficient solutions for processing large-scale data.
Stats
The data consists of observations x1, x2, ..., xm ∈ [N] and their labels y1, y2, ..., ym ∈ [0, M] for regression or y1, y2, ..., ym ∈ {-1, +1} for classification.
D ≤ N is the number of distinct feature values.