Preprint Communication Version 1 Preserved in Portico This version is not peer-reviewed

A Study on Automated Problem Troubleshooting in Cloud Environments with Rule Induction and Verification

Version 1 : Received: 7 December 2023 / Approved: 8 December 2023 / Online: 8 December 2023 (04:34:50 CET)

A peer-reviewed article of this Preprint also exists.

Poghosyan, A.; Harutyunyan, A.; Davtyan, E.; Petrosyan, K.; Baloian, N. A Study on Automated Problem Troubleshooting in Cloud Environments with Rule Induction and Verification. Appl. Sci. 2024, 14, 1047. Poghosyan, A.; Harutyunyan, A.; Davtyan, E.; Petrosyan, K.; Baloian, N. A Study on Automated Problem Troubleshooting in Cloud Environments with Rule Induction and Verification. Appl. Sci. 2024, 14, 1047.

Abstract

Remediation of IT issues encoded into domain-specific or user-defined alerts occurring in cloud environments and customer ecosystems in a vast majority of cases suffers from accurate recommendations which could be timely supplied for recovery of performance degradations. That, of course, is hard to realize through furnishing those abnormality definitions with an appropriate expert knowledge which varies from one environment to another. At the same time, in a large proportion of support cases, the reported problems under Global Support Services (GSS) or Site Reliability Engineering (SRE) treatment ultimately go down to the product teams making them waste costly development hours on investigating self-monitoring metrics of our solutions. Therefore, the mean-time-to-resolution (MTTR) rates of problems/alerts are significantly impacted from lack of a systematic approach towards adopting AI Ops. That would imply building, maintaining, and continuously improving/annotating a data store of insights that ML models are trained and generalized across the whole customer base and corporate cloud services. Our ongoing study is in line with such a vision and validates an approach that learns the alert resolution patterns in such a global setting and explains them using interpretable AI methodologies. The knowledge store of causative rules is then applied in predicting potential sources of the application degradation reflected in an active alert instance. In this communication, we share our experiences with a prototype solution and up to date analysis demonstrating how root conditions are discovered with a high accuracy for a specific type of problem. It is validated against the historical data of resolutions performed by heavy manual development efforts. We also offer a Dempster-Shafer theory-based rule verification framework for experts as a what-if analysis tool to test their hypotheses about underlying environment.

Keywords

automated troubleshooting; real-time product activity detection; problem root cause analysis; machine learning; explainable AI; proactive SaaS support

Subject

Computer Science and Mathematics, Artificial Intelligence and Machine Learning

Comments (0)

We encourage comments and feedback from a broad range of readers. See criteria for comments and our Diversity statement.

Leave a public comment
Send a private comment to the author(s)
* All users must log in before leaving a comment
Views 0
Downloads 0
Comments 0


×
Alerts
Notify me about updates to this article or when a peer-reviewed version is published.
We use cookies on our website to ensure you get the best experience.
Read more about our cookies here.