Core Concepts
Principal component analysis (PCA) has a significant denoising effect on high-dimensional data with underlying community structure, which can be quantified through a novel metric called compression ratio.
Abstract
The paper proposes a novel metric called compression ratio to capture the denoising effect of PCA on high-dimensional noisy data with underlying community structure. The key insights are:
For data with community structure, PCA significantly reduces the distance between data points belonging to the same community, while reducing the inter-community distance relatively mildly. This is demonstrated through both theoretical proofs and experiments on real-world data.
The compression ratio, defined as the ratio between the pre-PCA and post-PCA distances of data points, can be used to quantify this denoising effect. Intra-community pairs have higher compression ratios compared to inter-community pairs.
Building on the compression ratio metric, the paper proposes a simple outlier detection algorithm that removes points with low variance of compression ratios, as they do not share a common signal with others. This algorithm is shown to be competitive with popular outlier detection methods through simulations.
Experiments on real-world single-cell RNA-seq datasets demonstrate that the compression ratio metric captures the denoising effect of PCA, and removing outliers identified by the variance of compression ratio method improves the accuracy of clustering algorithms.
Overall, the paper introduces a novel geometric perspective to understand the denoising power of PCA, and shows its practical utility in improving the analysis of high-dimensional noisy data.
Stats
The paper does not provide any specific numerical data or statistics. It focuses on theoretical analysis and experimental validation of the compression ratio metric and the outlier detection method.