The paper presents panCAKES, a novel hierarchical compression algorithm that enables efficient, exact k-NN and ρ-NN search on compressed data. panCAKES assumes the manifold hypothesis and leverages the low-dimensional structure of the data to compress and search it efficiently.
The key highlights and insights are:
panCAKES uses a divisive hierarchical clustering algorithm to build a cluster tree. The tree structure allows for both efficient compression and search on the compressed dataset.
panCAKES supports compression of data under any distance function where the distance between two points is proportional to the memory cost of storing an encoding of one in terms of the other. This property holds for many widely-used distance functions, such as string edit distances and set dissimilarity measures.
panCAKES achieves compression ratios close to those of gzip, while offering sub-linear time performance for k-NN and ρ-NN search on the compressed data.
The authors provide theoretical analysis on the scaling behavior of cluster radii, showing that cluster radii are guaranteed to decrease by a multiplicative factor of √2/2 after at most d partitions, where d is the fractal dimensionality of the dataset.
The authors benchmark panCAKES on a variety of datasets, including genomic, proteomic, and set data, and compare the compression ratios and search performance between the compressed and uncompressed versions.
The authors provide an open-source implementation of panCAKES in the Rust programming language.
A otro idioma
del contenido fuente
arxiv.org
Ideas clave extraídas de
by Morgan E. Pr... a las arxiv.org 09-19-2024
https://arxiv.org/pdf/2409.12161.pdfConsultas más profundas