Core Concepts
This paper introduces a dataset designed to evaluate techniques for identifying the root causes of performance issues in microservice-based applications.
Abstract
The paper introduces a dataset for evaluating root cause analysis (RCA) techniques in microservice-based applications. The dataset includes metrics such as latency, requests, and availability collected from a distributed application comprising 41 microservices. In addition to normal operation metrics, the dataset includes 68 injected performance issues, such as request overload, memory leaks, CPU hogs, and misconfigurations, which increase latency and reduce availability throughout the system. The metrics are annotated with the corresponding issues, serving as ground truth for the analysis.
The authors showcase how this dataset can be used to evaluate the accuracy of a variety of RCA methods spanning different causal and non-causal characterizations of the problem. They find that causal methods perform well when provided the causal graph, but methods relying on learning the causal graph or the full structural causal model do not yield satisfactory performance when data is limited. In this case, a simple baseline of ranking potential root causes according to their correlation with the target and filtering for anomalous variables provides a strong baseline.
The authors hope that this dataset will enable further development of robust RCA techniques that can work well with limited data and do not require access to the causal graph, which is often not available in practice.
Stats
The dataset contains metrics such as latency, requests, and availability collected in 5-minute intervals from a distributed application comprising 41 microservices.