Core Concepts
The paper presents novel applications of the large sieve inequality from analytic number theory to obtain improved algorithms for various sparse pattern matching problems, including Sparse Nonnegative Convolution, Sparse General Convolution, Text-to-Pattern Hamming Distances, and the Constellation problem.
Abstract
The paper studies various problems related to sparse pattern matching, such as Sparse Convolution, Text-to-Pattern Hamming Distances, and the Constellation problem. Many of these problems can be reduced to dense instances using the mod-prime hash function, which has two main drawbacks: (1) The collision probability is O(log N/Q) rather than the optimal O(1/Q), and (2) it is difficult to derandomize the choice of the prime p.
The main technical contribution of the paper is the use of the large sieve inequality from analytic number theory to partially overcome these drawbacks in certain scenarios. Specifically:
-
Sparse Nonnegative Convolution:
- The paper obtains a Las Vegas algorithm that computes the convolution A ⋆ B of two nonnegative integer vectors A, B in O(t log t) time with 1 - 1/poly(t) probability, where t is the output sparsity.
- This simultaneously improves the previous O(t log t log log t)-time Las Vegas algorithm and the O(t log t)-time Monte Carlo algorithm with 2^{-sqrt(log t)} failure probability.
-
Sparse General Convolution:
- For the case where the length N of the input vectors satisfies N ≤ t^{1.99}, the paper gives a Monte Carlo O(t log t) time algorithm for sparse convolution with possibly negative input.
- This partially resolves an open question left by previous work on whether Sparse General Convolution can be solved in O(t log t + poly log(N∆)) time.
-
Text-to-Pattern Hamming Distances:
- The paper obtains a deterministic O(n√m log log m)-time algorithm that exactly computes the Hamming distance between a length-m pattern P and every length-m substring of a length-n text T.
- This improves the previous O(n√m(log m log log m)^{1/4})-time deterministic algorithm and nearly matches their O(n√m)-time Las Vegas algorithm.
The key technical component behind the Text-to-Pattern Hamming Distances result is a variant of the "X + Y lemma" that can be computed deterministically in O(N log(s^2/N) + N log log N) time, where s is the sum of the 1-norms of the input vectors.