Hierarchical Expert Multi-Agent Framework for Causal Root Cause Localization in Cloud-Native Microservices

Chen Qiu

doi:10.20944/preprints202511.0911.v1

Submitted:

11 November 2025

Posted:

12 November 2025

You are already at the latest version

Abstract

Cloud-native microservices have high complexity because they have dynamic dependencies, heterogeneous monitoring data, and many types of failures. This makes root cause localization a challenge. Existing methods often cannot balance accuracy and low latency, and they struggle with new failures, large system size, and multimodal data. This paper presents HEMA-RCL, a hierarchical expert multi-agent framework that uses different large models for collaborative diagnosis under complex dependencies. The framework uses layered expert agents led by a global orchestrator. It adds dynamic agent generation through efficient low-rank adaptation. It reaches agreement with belief propagation and causal enhancement to reduce hub misidentification. It also unifies multimodal data through temporal alignment and robust feature engineering. It applies context-aware prompt optimization to reduce hallucination in large models. HEMA-RCL improves on prior methods by enabling accurate, scalable, and efficient root cause localization in cloud-native microservice systems.

Keywords:

root cause localization

;

cloud-native microservices

;

multi-agent systems

;

causal consensus

;

multimodal diagnostics

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

Cloud-native architectures split applications into microservices that interact through complex and dynamic dependencies. This design improves scalability and resilience, but it makes root cause localization harder because anomalies can spread across services and heterogeneous telemetry data. The task is to separate true faults from propagated effects with strict limits on latency and accuracy. Traditional methods that use static graphs or single-modality inputs often fail when workloads change or when data are multimodal.

Recent work with graph neural networks and causal inference improves dependency modeling, but these approaches often misidentify hub services as root causes because of structural bias. Large language model based methods can analyze logs and metrics, but they face hallucination, context window limits, and overhead in coordination. Flat frameworks also increase communication bottlenecks as system scale grows.

HEMA-RCL is a hierarchical expert multi-agent framework that combines different large models to achieve accurate and efficient diagnosis. The layered design reduces communication cost. Dynamic agent generation adapts to new failures with small parameter overhead. Belief propagation with causal metrics helps avoid hub misidentification. Multimodal alignment combines metrics, logs, and traces in a consistent way. Context-aware prompting reduces hallucination and improves scalability for real-world use.

2. Related Work

Microservice diagnosis uses traces and metrics: Zhang et al. [1] exploit traces for fine granularity, while Wang et al.[2] integrate metrics for multimodality; both struggle with scalability and class imbalance.

Causal approaches refine localization: Xin et al.[3] apply causal reasoning, and Zhu et al.[4] extend to instance-level services; both rely on brittle, stable structures.

Graph methods improve accuracy at higher cost: Sun et al.[5] use graph autoencoders for interpretability, and Wang et al.[6] leverage knowledge graphs, increasing computation and maintenance.

Trace-based anomaly detection is sensitive yet fragile: Panahandeh et al.[7] detect deviations but degrade with noise or missing data. Broader modeling advances aid adaptability: Tian et al.[8] introduce cross-attention for heterogeneous tasks, Guan[9] applies ML to healthcare prediction, and Zhu and Liu[10] enhance NER via LoRA fine-tuning. Persistent gaps remain in scalability, robustness, and adaptation to novel failures.

3. Methodology

HEMA-RCL tackles microservice root-cause localization via a hierarchical expert multi-agent system coordinating heterogeneous LLMs: a DeepSeek-V2 (236B) Orchestrator routes to Qwen-72B and Qwen-14B specialists and dynamically spawns Qwen-7B sub-agents for compute–accuracy efficiency. For reproducibility, the appendix provides pseudocode for the Orchestrator main loop and belief-propagation message passing, detailing input formats, update rules, and termination criteria. Spawning is KL-gated on detected fault-pattern distributions, and agent disagreement is resolved by factor-graph belief propagation that fuses historical reliability with transfer entropy for causal inference. On 10,000 fault-injection scenarios, HEMA-RCL attains 87.6% top-1 root-cause accuracy and reduces mean time to detection to 38.7 s.

4. Algorithm and Model

4.1. HEMA-RCL Architecture Overview

HEMA-RCL coordinates heterogeneous LLMs in a unified diagnostic stack. DeepSeek-V2 (236B) orchestrates long-context reasoning; Qwen-72B performs high-throughput metric analysis; Qwen-14B mines log patterns. The system uses seven agent types:

\begin{matrix} S_{H E M A} = { & A_{o r c h}, A_{m a d}, A_{l p m}, A_{t r}, \\ A_{c i}, {A_{d s}^{i}}_{i = 1}^{N_{d y n a m i c}}, A_{c b}} \end{matrix}

(1)

where

N_{d y n a m i c}

is resource-bounded.

4.2. Hierarchical Expert System with LLM Specialization

Flat multi-agent designs incur quadratic coordination; with

> 5

agents, message passing reaches

67 %

runtime. We adopt a three-layer hierarchy to reduce coordination while preserving information fidelity. The orchestrator uses grounded context with a compact history. History is compressed via

{PCA}_{k = 32}

on recent features and centroids of similar retrieved contexts. Resource allocation uses a composite complexity metric with online weights

α = 0.3

β = 0.2

γ = 0.25

δ = 0.25

. The Metric Anomaly Detection Agent (MAD) fuses FFT-based spectral features, seasonal LSTM states over a sliding window, and wavelet coefficients. A learnable threshold, updated toward a conservative baseline, mitigates false positives after distribution shifts.

π_{o r c h} (a_{t} ∣ s_{t}) = DeepSeek - V 2 ({Prompt}_{g r o u n d e d} (\cdot))

(2)

\begin{matrix} C (s_{t}) & = α H (M_{t}) + β {∥ L_{t} ∥}_{0} \\ + γ depth (T_{t}) + δ {VAR}_{t e m p o r a l} (M_{t}) \end{matrix}

(3)

4.3. Dynamic Agent Spawning Mechanism

Pre-configured agents degrade on novel faults (up to

40 %

accuracy loss). We trigger spawning via a two-stage KL-gated policy (Figure 1):

\begin{matrix} p (spawn ∣ F_{t}) = σ (w^{⊤} [D_{K L} (F_{t} ∥ F_{k n o w n}), R_{a v a i l a b l e}, U_{c u r r e n t}]) \end{matrix}

(4)

Rapid adaptation uses rank-16 LoRA on Qwen-7B, cutting trainable parameters by

98.7 %

while retaining

94 %

performance:

A_{d s}^{n e w} = Qwen - 7 B (θ_{b a s e}) + {LoRA}_{r = 16} (Δ θ (F_{t}))

(5)

4.4. Belief Propagation-based Consensus Mechanism

Majority voting fails under correlated errors. We use factor-graph belief propagation with time-decayed reliability (Figure 2):

r_{i} (t) = \frac{\sum_{τ = 1}^{t} I [{correct}_{i} (τ)] e^{- α (t - τ)}}{\sum_{τ = 1}^{t} e^{- α (t - τ)}}

(6)

Messages encode semantic compatibility and historical agreement:

\begin{matrix} ψ_{i j} (x_{i}, x_{j}) = exp (- λ d_{s e m a n t i c} (x_{i}, x_{j}) \cdot \frac{1}{\sqrt{r_{i} r_{j}}} \cdot (1 + ϵ_{i j})) \end{matrix}

(7)

Damping stabilizes loopy graphs:

μ_{i \to j}^{(t + 1)} = (1 - ρ) μ_{i \to j}^{(t)} + ρ {\hat{μ}}_{i \to j}^{(t + 1)}

(8)

This reduces consensus time from

8.3

2.1

4.5. Causal Inference Enhancement

Correlation-based selection overweights hubs (43% accuracy). We incorporate transfer entropy with adaptive delay:

\begin{matrix} TE (u \to v) = \sum_{t} p (v_{t + Δ t}, v_{t}, u_{t}) log \frac{p (v_{t + Δ t} ∣ v_{t}, u_{t})}{p (v_{t + Δ t} ∣ v_{t})} \end{matrix}

(9)

\begin{matrix} Δ t_{o p t i m a l} = arg max_{Δ t \in [1, 10]} [TE (u \to v; Δ t) - λ_{p e n a l t y} Δ t] \end{matrix}

(10)

We integrate with GNNs via position-agnostic attention:

\begin{matrix} h_{v}^{(l + 1)} = σ (W_{s e l f}^{(l)} h_{v}^{(l)} + \sum_{u \in N (v)} α_{v u}^{(l)} W_{m s g}^{(l)} h_{u}^{(l)} + b_{s t r u c t} (G)) \end{matrix}

(11)

where

b_{s t r u c t} (G)

is structure-encoded and position-agnostic.

4.6. Data Preprocessing

Figure 3 summarizes the pipeline.

4.6.1. Multi-modal Data Alignment and Temporal Synchronization

Metrics arrive every 15 s; logs are event-driven; traces are asynchronous. Naive fusion degraded accuracy by 31%. We align to a unified grid, interpolate irregular logs, progressively encode traces, and apply robust normalization.

T_{u n i f i e d} = {τ_{k} : τ_{k} = t_{0} + k \cdot Δ t_{b a s e}, k \in [0, K]}

(12)

Traces use progressive completion with confidence

α = min (1, n_{c o m p l e t e} / n_{e x p e c t e d})

t_{p r o g} (τ_{k}) = α t_{c o m p l e t e} (τ_{k}) + (1 - α) {\hat{t}}_{p a r t i a l} (τ_{k})

(13)

Robust scaling uses MAD:

{\tilde{m}}_{i j} = \frac{m_{i j} - median (m_{j})}{MAD (m_{j}) \cdot 1.4826}

(14)

This alignment and scaling reduced false positives by 43%.

4.6.2. Adaptive Feature Engineering and Dimensionality Reduction

Standard PCA preserved only 67% fault-relevant variance despite 95% total variance. We compute feature importance, aggregate multi-scale statistics (windows

s \in {1, 5, 15, 60}

; mean, max, gradient, FFT), and learn a supervised low-dimensional embedding with fault-aware reconstruction.

\begin{matrix} I_{i m p o r t a n c e} (f_{i}) & = ω_{1} I (f_{i}; Y_{f a u l t}) + ω_{2} \sum_{j} | TE (f_{i} \to f_{j}) | \\ + ω_{3} {VAR}_{c o n d i t i o n a l} (f_{i} ∣ Y_{f a u l t}) \end{matrix}

(15)

L_{r e d u c e} = ∥ X - \hat{X} ∥_{2}^{2} + β \sum_{i \in F_{f a u l t}} ∥ x_{i} - {\hat{x}}_{i} ∥_{2}^{2} + γ {∥ W_{e n c o d e r} ∥}_{1}

(16)

This reduced dimensionality by 94% and improved fault-detection F1 by 12%.

5. Experiment Results

5.1. Experimental Setup

We evaluate HEMA-RCL on 12,000 fault injection scenarios across three production deployments: 156-service e-commerce platform, 89-service streaming application, and 234-service financial system. Faults cover six categories: resource exhaustion, network issues, application errors, configuration drift, cascading failures, and Byzantine faults.

6. Conclusions

This paper presented HEMA-RCL, a hierarchical expert multi-agent system that leverages diverse large language models for microservice root cause localization. Through dynamic agent spawning, belief propagation-based consensus, and sophisticated prompt engineering, our framework achieves 87.6% top-1 accuracy while reducing mean time to detection by 50.6% compared to state-of-the-art methods. The comprehensive experimental evaluation across 12,000 fault scenarios demonstrates robust performance across diverse fault types and scales effectively to hundreds of services. The ablation studies confirm the critical contribution of each architectural component, particularly the hierarchical structure and multi-modal fusion mechanisms. HEMA-RCL establishes a new paradigm for intelligent operations in cloud-native environments, paving the way for more autonomous and reliable microservice management.

References

Zhang, C.; Dong, Z.; Peng, X.; Zhang, B.; Chen, M. Trace-based multi-dimensional root cause localization of performance issues in microservice systems. In Proceedings of the Proceedings of the IEEE/ACM 46th International Conference on Software Engineering, 2024, pp. 1–12.

Wang, Y.; Zhu, Z.; Fu, Q.; Ma, Y.; He, P. MRCA: Metric-level root cause analysis for microservices via multi-modal data. In Proceedings of the Proceedings of the 39th IEEE/ACM International Conference on Automated Software Engineering, 2024, pp. 1057–1068.

Xin, R.; Chen, P.; Zhao, Z. Causalrca: Causal inference based precise fine-grained root cause localization for microservice applications. Journal of Systems and Software 2023, 203, 111724.

Zhu, Y.; Wang, J.; Li, B.; Zhao, Y.; Zhang, Z.; Xiong, Y.; Chen, S. Microirc: Instance-level root cause localization for microservice systems. Journal of Systems and Software 2024, 216, 112145.

Sun, Y.; Lin, Z.; Shi, B.; Zhang, S.; Ma, S.; Jin, P.; Zhong, Z.; Pan, L.; Guo, Y.; Pei, D. Interpretable failure localization for microservice systems based on graph autoencoder. ACM Transactions on Software Engineering and Methodology 2025, 34, 1–28.

Wang, T.; Qi, G.; Wu, T. KGroot: A knowledge graph-enhanced method for root cause analysis. Expert Systems with Applications 2024, 255, 124679.

Panahandeh, M.; Hamou-Lhadj, A.; Hamdaqa, M.; Miller, J. ServiceAnomaly: An anomaly detection approach in microservices using distributed traces and profiling metrics. Journal of Systems and Software 2024, 209, 111917.

Tian, Q.; Zou, D.; Han, Y.; Li, X. A Business Intelligence Innovative Approach to Ad Recall: Cross-Attention Multi-Task Learning for Digital Advertising. In Proceedings of the 2025 IEEE 6th International Seminar on Artificial Intelligence, Networking and Information Technology (AINIT). IEEE, 2025, pp. 1249–1253.

Guan, S. Predicting Medical Claim Denial Using Logistic Regression and Decision Tree Algorithm. In Proceedings of the 2024 3rd International Conference on Health Big Data and Intelligent Healthcare (ICHIH), 2024, pp. 7–10. [CrossRef]

Zhu, Y.; Liu, Y. LLM-NER: Advancing Named Entity Recognition with LoRA+ Fine-Tuned Large Language Models. In Proceedings of the 2025 11th International Conference on Computing and Artificial Intelligence (ICCAI), 2025, pp. 364–368. [CrossRef]

Method/Fault Type

Acc@1

Acc@3

NDCG@5

MTTD(s)

F1-Score

Baseline Methods

MicroRCA

0.542

0.721

0.687

142.3

0.521

CloudRanger

0.618

0.793

0.742

98.7

0.598

TraceDiag

0.695

0.851

0.812

68.4

0.678

HEMA-RCL (full)

0.876

0.952

0.928

38.7

0.859

HEMA-RCL by Fault Type

Resource Exhaustion

0.913

0.968

0.947

32.4

0.905

Network Issues

0.856

0.941

0.912

41.2

0.848

Application Errors

0.892

0.959

0.934

35.8

0.900

Configuration Drift

0.821

0.923

0.898

48.6

0.807

Cascading Failures

0.847

0.938

0.916

44.3

0.854

Byzantine Faults

0.783

0.901

0.876

52.1

0.769

Configuration

Acc@1

MTTD(s)

Δ

Acc@1

GPU-Hours

Component Ablation

Full Model

0.876

38.7

–

2.18

w/o Dynamic Spawning

0.812

45.2

-7.3%

1.92

w/o Belief Propagation

0.831

42.1

-5.1%

2.05

w/o Hierarchical Structure

0.798

51.3

-8.9%

2.76

w/o Multi-modal Fusion

0.789

48.7

-9.9%

1.84

Single LLM (DeepSeek-V2)

0.751

67.3

-14.3%

1.12

Scalability (Service Count)

50 services

0.912

24.3

+4.1%

0.82

100 services

0.891

31.7

+1.7%

1.43

150 services

0.876

38.7

0.0%

2.18

200 services

0.863

46.2

-1.5%

3.02

250 services

0.851

54.8

-2.9%

3.91

Hierarchical Expert Multi-Agent Framework for Causal Root Cause Localization in Cloud-Native Microservices

Abstract

Keywords:

Subject:

1. Introduction

2. Related Work

3. Methodology

4. Algorithm and Model

4.1. HEMA-RCL Architecture Overview

4.2. Hierarchical Expert System with LLM Specialization

4.3. Dynamic Agent Spawning Mechanism

4.4. Belief Propagation-based Consensus Mechanism

4.5. Causal Inference Enhancement

4.6. Data Preprocessing

4.6.1. Multi-modal Data Alignment and Temporal Synchronization

4.6.2. Adaptive Feature Engineering and Dimensionality Reduction

4.7. Prompt Design

4.7.1. Context-Aware Dynamic Prompt Generation

4.7.2. Few-Shot Learning and Prompt Optimization

4.8. Evaluation Metrics

4.8.1. Primary Metrics

4.8.2. Secondary Metrics

5. Experiment Results

5.1. Experimental Setup

5.2. Main Results and Comparisons

5.3. Ablation Studies and Scalability

5.4. Real-Time Performance

6. Conclusions

References

MDPI Initiatives

Important Links

Subscribe