Core Concepts
A self-organizing clustering system that can efficiently monitor and detect distribution changes in high-dimensional data streams without any assumptions about the data distribution.
Abstract
The proposed framework addresses the problem of distribution shift detection in high-dimensional data streams. The key highlights are:
It develops a bio-inspired self-organizing clustering system to assess distribution changes in data streams, which can be applied in unsupervised contexts.
It investigates the use of Self-Organizing Maps (SOMs) and Scale-Invariant Maps (SIMs) for dimensionality reduction, exploring the statistical aspects of the latent space.
By construction, the framework generates a univariate signal that can be reasonably assumed to be Gaussian, enabling efficient computation of the Kullback-Leibler divergence to quantify distribution changes.
The framework first applies a non-linear projection of the high-dimensional input data using a topology-preserving mapping (SOM or SIM). It then creates a distance matrix capturing the geometric relationships between the projected points and the representative neurons. This distance matrix is further embedded using statistical summaries, with the mean function being the focus in this work.
The resulting univariate signal is monitored for distribution changes using the Kullback-Leibler divergence between consecutive chunks of data. A simple decision rule based on outlier detection is used to identify significant distribution shifts.
The empirical evaluation on synthetic and real-world datasets, including MNIST with adversarial samples, gas sensor measurements, and ozone data, demonstrates the potential of the proposed approach in efficiently detecting distribution shifts in high-dimensional data streams.
Stats
The mean function of the distance matrix captures sufficient information to distinguish between the original MNIST samples and the adversarial MNIST samples.