The proposed framework addresses the problem of distribution shift detection in high-dimensional data streams. The key highlights are:
It develops a bio-inspired self-organizing clustering system to assess distribution changes in data streams, which can be applied in unsupervised contexts.
It investigates the use of Self-Organizing Maps (SOMs) and Scale-Invariant Maps (SIMs) for dimensionality reduction, exploring the statistical aspects of the latent space.
By construction, the framework generates a univariate signal that can be reasonably assumed to be Gaussian, enabling efficient computation of the Kullback-Leibler divergence to quantify distribution changes.
The framework first applies a non-linear projection of the high-dimensional input data using a topology-preserving mapping (SOM or SIM). It then creates a distance matrix capturing the geometric relationships between the projected points and the representative neurons. This distance matrix is further embedded using statistical summaries, with the mean function being the focus in this work.
The resulting univariate signal is monitored for distribution changes using the Kullback-Leibler divergence between consecutive chunks of data. A simple decision rule based on outlier detection is used to identify significant distribution shifts.
The empirical evaluation on synthetic and real-world datasets, including MNIST with adversarial samples, gas sensor measurements, and ozone data, demonstrates the potential of the proposed approach in efficiently detecting distribution shifts in high-dimensional data streams.
Vers une autre langue
à partir du contenu source
arxiv.org
Questions plus approfondies