Core Concepts
Introducing Grafite, a novel range filter that provides robust and predictable false positive rates across all datasets and query workloads.
Abstract
The content discusses the challenges faced by existing range filters in handling adversarial queries and introduces Grafite as a solution. It explains the design of Grafite, its theoretical guarantees, and experimental evaluations showcasing its effectiveness. Additionally, a heuristic range filter named Bucketing is introduced for comparison.
Introduction to Range Filters
Range filters allow checking query key intersections efficiently.
They are crucial for various applications like networking, databases, and search engines.
Existing Challenges
Practical range filters face high false positive rates with adversarial inputs.
Correlation between keys and queries poses a significant challenge.
Introduction of Grafite
Grafite offers clear guarantees regardless of input data and query distribution.
It provides faster queries, construction times, and robust false positive rates.
Comparison with Heuristic Filter (Bucketing)
Bucketing is a simple heuristic filter effective on uncorrelated queries.
Demonstrates that simpler solutions can match or surpass complex heuristic designs.
Theoretical Analysis
Theoretical comparisons show Grafite's superiority over existing solutions.
Grafite's space-time performance aligns with optimal bounds for range filters.
Experimental Evaluation
Extensive experiments demonstrate Grafite's efficiency across datasets and query workloads.
Grafite outperforms competitors in handling correlated query workloads.
Future Directions
Potential enhancements in handling in-place insertions remain an open problem.
Stats
Given a fixed space budget of π΅ bits per key, the false positive probability is upper bounded by β/2π΅β2.
The predecessor operation in Elias-Fano encoding allows efficient search within hash codes.
Quotes
"No current design can handle adversarial workloads practically."
"Grafite offers clear guarantees that hold regardless of the input data and query distributions."