Preprint
Article

This version is not peer-reviewed.

Integrating Large Language Models with Cloud-Native Observability for Automated Root Cause Analysis and Remediation

Submitted:

20 December 2025

Posted:

22 December 2025

You are already at the latest version

Abstract
Cloud-native systems based on microservices, containers, and server- less architectures present unprecedented challenges for observabil- ity and incident management. Traditional rule-based monitoring and manual root cause analysis are increasingly inadequate for han- dling the complexity and scale of modern distributed systems. This paper presents a novel framework that leverages large language models (LLMs) to enhance cloud-native observability, enabling automated root cause analysis and self-healing capabilities. Our system integrates OpenTelemetry-based telemetry collection with a domain-adapted LLM capable of performing multimodal analysis over metrics, logs, and traces. Through fine-tuning on operational data and chain-of-thought reasoning, the LLM generates explain- able root cause hypotheses and actionable remediation plans. Exper- imental evaluation on public microservice datasets demonstrates that our approach reduces mean time to resolution (MTTR) by 84.2% compared to rule-based methods, achieving 95% F1-score in anomaly detection while maintaining low computational overhead. The system successfully automated 91% of common incidents with- out human intervention, significantly improving service reliability and reducing operational burden.
Keywords: 
;  ;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated