核心概念
Widespread inconsistencies and flaws in benchmarking practices for graph processing systems lead to misleading and non-reproducible results.
摘要
The authors conducted a 12-year literary review of graph processing benchmarking practices and found significant issues:
Lack of standardization in benchmarking practices, with a wide variety of datasets, benchmarks, and metrics used across the literature.
Inconsistent use of datasets, with different papers reporting different vertex and edge counts for the same datasets.
Overuse of synthetic graph generators that produce graphs with unrealistic characteristics, distorting performance results.
Significant impact of dataset properties, such as vertex ordering and presence of zero-degree vertices, on benchmark performance, which is often ignored.
The authors then conducted a quantitative study to demonstrate the severity of these issues. They showed that:
Vertex ordering can cause up to 38% performance differences in PageRank on popular graph processing systems.
The presence of zero-degree vertices can lead to a 10x performance boost for benchmarks like BFS and Triangle Counting.
Different graph processing systems report different numbers of triangles for the same directed graph dataset, due to a lack of standardization in triangle counting definitions.
The authors conclude by proposing a set of best practices for benchmarking graph processing systems, including:
Developing a standardized set of benchmarks and datasets
Using the Smooth Kronecker graph generator for synthetic datasets
Reporting detailed preprocessing steps and metrics
Specifying vertex ordering and triangle counting definitions
Selecting appropriate datasets for each benchmark
統計資料
Changing the vertex ID assignment of the Twitter2010 dataset can cause a performance difference of up to 38% for the PageRank benchmark on several popular graph processing systems.
The presence of isolated vertices in the citPatents dataset can cause a 10x performance boost for benchmarks such as BFS and Connected Components.
引述
"Evaluations frequently ignore datasets' statistical idiosyncrasies, which significantly affect system performance."
"Scalability studies often use datasets that fit easily in memory on a modest desktop."
"Currently, the community has no consistent and principled manner with which to compare systems and provide guidance to developers who wish to select the system most suited to their application."