The paper presents a system called MARIO that addresses the challenge of efficiently processing and analyzing Azure Core workload insights data to detect high-significant faults or anomalies. The key highlights are:
Azure Core workload insights data contains time-series data with various metrics, resources, and dimensions. Faults or anomalies need to be detected for each metric-resource-dimension combination.
The goal is to identify a limited set of highly significant anomalies (5-20 per hour) that are easily perceivable by users, have high reconstruction error in time-series forecasting models, and can help in proactive issue detection and root cause analysis.
The proposed solution has two stages:
The system is deployed as part of the MARIO service and has been tested on Azure Core workload insights data as well as benchmark datasets like Electricity and Volatility. It outperforms state-of-the-art methods like Temporal Fusion Transformers (TFT) and DeepAR in identifying high-significant anomalies.
The solution is designed to provide high transparency through confidence scores, enable proactive issue detection, and reduce cognitive load on users by surfacing only the most critical anomalies.
To Another Language
from source content
arxiv.org
Key Insights Distilled From
by Pranay Lohia... at arxiv.org 04-16-2024
https://arxiv.org/pdf/2404.09302.pdfDeeper Inquiries