Sign In

Automated Detection of High-Impact Faults in Azure Core Workload Insights

Core Concepts
A system named MARIO has been developed to automatically detect high-significant faults or anomalies in Azure Core workload insights data using a combination of time-series forecasting models and Extreme Value Theory.
The paper presents a system called MARIO that addresses the challenge of efficiently processing and analyzing Azure Core workload insights data to detect high-significant faults or anomalies. The key highlights are: Azure Core workload insights data contains time-series data with various metrics, resources, and dimensions. Faults or anomalies need to be detected for each metric-resource-dimension combination. The goal is to identify a limited set of highly significant anomalies (5-20 per hour) that are easily perceivable by users, have high reconstruction error in time-series forecasting models, and can help in proactive issue detection and root cause analysis. The proposed solution has two stages: Stage 1: Use an enhanced version of Microsoft's Anomaly Detection as a Service (ADaaS) to yield fewer anomalies (less than 150 per hour) without reducing true positives. Stage 2: Employ Extreme Value Theory (EVT) on top of the enhanced ADaaS to further filter out low-significance anomalies and identify only the high-significant ones (around 5-20 per hour). The system is deployed as part of the MARIO service and has been tested on Azure Core workload insights data as well as benchmark datasets like Electricity and Volatility. It outperforms state-of-the-art methods like Temporal Fusion Transformers (TFT) and DeepAR in identifying high-significant anomalies. The solution is designed to provide high transparency through confidence scores, enable proactive issue detection, and reduce cognitive load on users by surfacing only the most critical anomalies.
The Azure Core workload insights data contains 152 unique resource types, 62 unique metrics, and 33,647 unique dimension values, resulting in 139,483,854 records. The Electricity dataset has 26 anomalies out of 9,000 data points (0.29%), and the Volatility dataset has 21 anomalies out of 510 data points (4.12%).
"The number of anomalies reported should be highly significant and in a limited number, e.g., 5-20 anomalies reported per hour." "The reported anomalies will have significant user perception and high reconstruction error in any time-series forecasting model."

Key Insights Distilled From

by Pranay Lohia... at 04-16-2024
High Significant Fault Detection in Azure Core Workload Insights

Deeper Inquiries

What other techniques or models could be explored to further improve the identification of high-significant anomalies in Azure Core workload insights data

To further enhance the identification of high-significant anomalies in Azure Core workload insights data, exploring ensemble methods could be beneficial. Ensemble methods involve combining multiple models to improve prediction accuracy and robustness. By leveraging techniques like Random Forest, Gradient Boosting, or Stacking, the solution can benefit from the diversity of models and their collective ability to capture different aspects of the data. Ensemble methods can help in reducing overfitting, enhancing generalization, and improving the overall anomaly detection performance.

How can the proposed solution be extended to handle multivariate time-series data and capture the relationships between different metrics and resources

Extending the proposed solution to handle multivariate time-series data and capture relationships between different metrics and resources can be achieved through the implementation of techniques like Vector Autoregression (VAR) models, Long Short-Term Memory (LSTM) networks, or Graph Neural Networks (GNNs). VAR models can capture dependencies between multiple time series variables, LSTM networks can handle sequential data effectively, and GNNs can model complex relationships in a graph structure. By incorporating these techniques, the solution can analyze interdependencies between metrics, resources, and dimensions, providing a more comprehensive understanding of anomalies in a multivariate context.

What are the potential applications of the high-significant anomaly detection approach beyond Azure Core workload insights, and how can it be generalized to other domains

The high-significant anomaly detection approach developed for Azure Core workload insights data has broad applications across various domains beyond Azure. It can be generalized to industries such as finance for fraud detection, healthcare for patient monitoring, manufacturing for predictive maintenance, and cybersecurity for threat detection. By adapting the solution to different domains and datasets, organizations can proactively identify critical anomalies, prevent system failures, optimize operations, and enhance decision-making processes. The approach's flexibility allows it to be tailored to diverse use cases, making it a valuable tool for anomaly detection in a wide range of applications.