Preprint
Article

This version is not peer-reviewed.

A Multi-Agent ChatOps Architecture for Reliable AI-Assisted Large-Scale Network Operations

Submitted:

10 June 2026

Posted:

11 June 2026

You are already at the latest version

Abstract
Large-scale network operations require engineers to work across heterogeneous tools and dashboards, which can lengthen Mean Time to Repair (MTTR) and affect availability when it delays incident resolution. We present an agentic ChatOps system in which a supervisor orchestrates large language model (LLM) agents that route natural-language intents to specialized workers issuing planned tool calls across operations systems, with retrieval-augmented grounding in a network source of truth. We deploy the approach in a major global network—a deployment that has since grown to fourteen specialized workers plus dedicated expert and autonomous agents—and detail seven representative use cases, including closed-loop link-flapping remediation that validates candidate changes in a digital twin before committing version-controlled configuration. Across six recurring tasks, the integrated ChatOps automation system was associated with per-event handling times lower by roughly 10× to 2400×, and the link-flap cycle was shortened from about 30 to 3 minutes. Over a representative 90-day window it handled roughly 7,400 production interactions, with positive feedback on most rated responses; an offline benchmark of 200 questions scored by an LLM-as-a-judge yielded mean relevance and context relevance of 0.85 and 0.79. The results support modeled availability improvement when saved handling time reduces incident MTTR, in a deployment designed for confidentiality and guarded by validation pre-checks.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  ;  

1. Introduction

Operating a large-scale network has traditionally demanded a substantial amount of manual intervention. Day-to-day work in a global Network Operations Center (NOC) divides broadly into two classes of activity: provisioning and troubleshooting. Provisioning tasks are high-frequency and are paced by customer demand—in carrier-grade and interconnection environments the principal customers are cloud service providers, tier-2 Internet service providers (ISPs), and other top-tier telecommunication operators, whose own growth directly drives the volume of changes requested. The number of provisioning tasks fluctuates strongly with time of day, time zone, and external events, which makes capacity planning for the operations workforce difficult. Troubleshooting tasks, in turn, are driven both by the volume of provisioning activity and by exogenous factors such as third-party carrier faults, subsea cable cuts, scheduled maintenance, aging or faulty equipment, and human error during corrective actions. Every manual task, even a repetitive one, is a potential source of human error and of avoidable delay.
A defining characteristic of these environments is tool sprawl. The systems that support monitoring, tracking, locating, storing, and interacting with network state are sourced from many different software suppliers: some expose modern programmable interfaces (APIs), while others remain legacy tools accessible only through bespoke consoles or dashboards. Maintaining a true “single pane of glass” across this heterogeneous estate is difficult, and requiring engineers to context-switch between disparate tools for each task lengthens task completion time and, critically, the time required to restore impacted services. Because service availability is inversely related to Mean Time to Repair (MTTR), time an engineer spends navigating tools rather than resolving the underlying fault can translate into reduced availability, and possible Service Level Agreement (SLA) violations, when it lies on the incident critical path.
Recent progress in generative artificial intelligence (AI), and in particular the rise of agentic systems built on large language models (LLMs), creates an opportunity to reframe these interactions. An LLM agent can translate a natural-language statement of intent into a sequence of tool invocations, reason over the heterogeneous data returned, and present a consolidated, human-readable answer. This shifts the operator’s workflow from procedural (remember which dashboard holds which datum, log in, query, copy, correlate) to intent-driven (state the goal; let the agent plan and execute). A mature open-source and commercial ecosystem now supports such systems: frameworks such as LangChain and LangGraph integrate LLMs, embedding models, vector databases, and Retrieval-Augmented Generation (RAG) into a single programmable substrate, and LangGraph in particular provides primitives for constructing stateful, multi-step agentic workflows.
Translating this opportunity into a production-grade capability for a global network operator is, however, far from a “plug-and-play” exercise. Three constraints dominate. First, data sensitivity: agents interact with confidential customer and infrastructure data during both provisioning and troubleshooting, so the underlying models must, where required, run on-premises without data leaving the operator’s trust boundary. The proliferation of accelerated hardware (e.g., NVIDIA GPU clusters) and serving software (e.g., Ollama, vLLM) now makes local execution of capable open-weight models practical. Second, tool heterogeneity and scale: the agent must interface with dozens of APIs of varying quality, and the design must remain robust as tools are added or replaced. Third, trust and observability: autonomous actions that touch a live production network require guardrails, evaluation, and auditability so that the system adheres to responsible-AI principles—explainability, robustness, transparency, and privacy. We address the third constraint with an open-source tracing and observability layer (Langfuse) combined with LLM-as-a-judge scoring—on dimensions such as helpfulness, correctness, hallucination, and relevance—and explicit user feedback, providing per-interaction quality and audit signals.
This article presents an end-to-end design in which AI agents, fronted by a ChatOps interface, execute network-operations tasks that span many tools and dashboards on the basis of natural-language commands. We deploy and evaluate the approach inside a major global interconnection and network environment—a multi-region, carrier-scale interconnection network—across an operational capability set that has grown to fourteen specialized workers plus dedicated expert and autonomous agents (Table 3); we examine seven representative use cases in detail—spare locating, console-information retrieval, packet-loss and latency analysis, control-plane path retrieval, node-isolation detection, vendor knowledge-base search, and automated remediation of backbone link-flapping events—plus a composite ChatOps aggregator workflow that chains these capabilities behind a single escalation thread. Our central finding is that intent-driven, agentic execution is associated with per-event wall-clock times one to three orders of magnitude lower for these tasks, which can in turn lower MTTR when they lie on the incident critical path and, through the availability model, provides a modeled pathway to improved end-to-end service availability.
Contributions. The specific contributions of this article are:
(i)
A reference architecture that couples a ChatOps front end with a supervisor-orchestrated multi-agent system—specialized domain agents under an LLM supervisor that routes by reasoning and confidence and dispatches workers in parallel—an intelligent query router that arbitrates, by data-sensitivity class, between a locally hosted open-weight model and managed cloud models, a RAG subsystem over a network “source of truth,” and a heterogeneous tool layer spanning inventory, monitoring, control-plane, and configuration-management systems.
(ii)
The design of seven production operational use cases, each formulated as an agentic workflow combining planning, function calling, and retrieval, including a backbone link-flapping remediation workflow that fuses data-plane (SNMP-trap) and control-plane (BGP Monitoring Protocol) signals with digital-twin simulation prior to any configuration change.
(iii)
An evaluation methodology grounded in measured production data—measured per-event handling times, task frequencies, volumes, user feedback, and token-usage signals from Langfuse traces—together with an analytical model that links the potential MTTR reduction to per-service availability on a representative inter-continental topology.
(iv)
A quantitative, case-study demonstration, on real operational data from a major global network, of per-event handling-time reductions by factors ranging from roughly 10 × to 2400 × , with a discussion of the security, safety, generalizability, and token-usage considerations that govern production adoption.
This article is an extended version of our conference paper presented at the 2025 IEEE Conference on Artificial Intelligence (CAI) [1]. Relative to the conference version, it adds a formal problem formulation and a structured methodology (Section 3), a substantially expanded review of agentic AI, RAG, and LLMs-for-networking with a comparative positioning table (Section 2), a dedicated evaluation methodology including the Langfuse-based quality assessment and a consolidated quantitative analysis (Section 4), and an entirely new discussion of security, reliability, generalizability, token usage, and threats to validity (Section 5).
The remainder of the article is organized as follows. Section 2 reviews related work on agentic AI design patterns, retrieval-augmented generation, and LLMs for networking and IT operations, and positions our contribution. Section 3 formalizes the problem and details the system architecture, agent design, model selection, tool integration, the operational use cases, and the evaluation methodology. Section 4 reports the empirical results. Section 5 discusses the implications, limitations, and threats to validity, and Section 6 concludes and outlines future work.

3. Materials and Methods

3.1. Problem Formulation

We model the operator’s day as a stream of tasks t T , where T is partitioned into provisioning and troubleshooting classes. Each task type k occurs with empirical frequency f k (events per week) and incurs a handling time τ k (seconds per event), measured as the wall-clock time from the moment an engineer begins the task to the moment the required information is obtained or the corrective action is applied. The total weekly operational effort is E = k f k τ k . In this study we report the per-event handling times τ k and the production frequencies f k separately, and we do not estimate the aggregate weekly effort E, because the two quantities were measured over different periods; E therefore serves only as a conceptual model that motivates the objective below.
For troubleshooting tasks that lie on the critical path of an active incident, τ k contributes directly to the Mean Time to Repair (MTTR). Service availability A is related to MTTR and to the Mean Time Between Failures (MTBF) by
A = MTBF MTBF + MTTR ,
so that, for a fixed MTBF (a property of hardware and design that operations cannot readily change), reducing MTTR strictly increases availability. The objective of an agentic operations system is therefore to minimize τ k —and hence E and MTTR—by replacing manual, multi-tool procedures with intent-driven automation, without sacrificing correctness, safety, or data confidentiality. Formally, the agent implements a mapping g : q a from a natural-language intent q to an action or answer a, realized as a planned sequence of tool calls o 1 , , o m over the available tool set O , subject to guardrail constraints C (e.g., “no configuration change without a passing digital-twin pre-check”).

3.2. System Architecture

The entry point is a chat interface on an enterprise collaboration platform (e.g., Slack, Microsoft Teams, or Jabber), which already serves as the place where all network teams open, track, and escalate tickets. A WebSocket connection is established between this front end and the agent (the “Bot”)—a supervisor-orchestrated multi-agent system (Section 3.3)—which listens for mentions, extracts intent, and routes work to specialized agents. The Bot runs on an operator-controlled server and brokers every downstream interaction.
A central component is the Intelligent Query Router, which decides—per request and per data-sensitivity class—which model deployment serves the request: a locally hosted open-weight model or a managed cloud model (the deployment has evolved on this axis; see Section 3.4). Sensitive credentials are held in a secrets-encryption platform rather than embedded in prompts or code, and a vector database stores embeddings used by the retrieval-augmented subsystem (a hybrid dense/sparse pipeline with neural reranking; Section 3.6). On the right side of the architecture, the Bot integrates with the operational tool estate over HTTPS/API keys and gRPC: case and incident management, monitoring, AIOps, network management systems (NMS), inventory management, DDoS protection, log aggregation, and troubleshooting/visualization toolkits.
Figure 1 shows the overall reference architecture.

3.3. Multi-Agent Architecture: Supervisor and Specialized Workers

Rather than a single monolithic agent, the system is organized as a supervisor-orchestrated multi-agent architecture, built on a stateful agent-graph framework with checkpointed state. Each intent first passes through a context-summarization entry step (token budgeting over the conversation) and then to a supervisor that, using an LLM constrained to structured output (with sticky routing and fallback chains), selects one or more specialized agents together with an explicit routing rationale and a confidence score; when an intent spans several domains the supervisor dispatches multiple workers in parallel, after which a result-synthesis step merges their outputs into a single grounded response posted to the chat thread. This division of labor follows the multi-agent design pattern (Section 2.1). As the deployment matured, the routed worker set grew to fourteen specialized domain workers—each a domain expert (for example, inventory and infrastructure, network infrastructure and topology, IP/DNS, change requests, service and incident management, vendor support, and knowledge-base retrieval) equipped with its own task-appropriate tool set and tool-execution loop. Three classes of dedicated agents complement the routed workers: pattern-matched expert agents (change validation and configuration audit); autonomous fast-path agents (self-healing and network troubleshooting) that detect, act, and return directly without the synthesis step; and an incident-management policy gate invoked by the autonomous agents. Table 3 summarizes the full capability set, and different agents may use different underlying models (Section 3.4).
Each specialized worker internally executes a closed perceive–plan–act–reflect loop: it (1) extracts the goal and any structured parameters (e.g., a device name or a customer UUID); (2) plans a trajectory, decomposing the goal into sub-tasks and selecting tools; (3) issues a tool call via function calling and observes the structured response; (4) optionally reflects and, if the observation is incomplete or inconsistent, revises the plan and iterates; and (5) returns a grounded result. Tools are exposed to the workers through a uniform interface—a Model Context Protocol (MCP) tool layer—which keeps the design robust as tools are added or replaced.
Two cross-cutting mechanisms support long-running, production operation. First, context management: because escalation threads can grow long, the supervisor protects the model’s context window by summarizing prior turns and budgeting the tokens devoted to history, preserving the facts relevant to routing and continuation while discarding noise. Second, robustness: the supervisor detects transient failures and falls back to an alternative worker rather than aborting. A short-term memory holds the conversation and intermediate observations, a retrieval step injects long-term knowledge from the vector store, and guardrails wrap any state-mutating action—which is gated behind validation (Section 3.7.7)—while every interaction is traced for auditability and evaluation (Section 3.8).
Figure 2 summarizes the current supervisor-worker architecture.
Figure 3 details the worker control loop used by each routed agent.

3.4. Model Selection and Deployment

LLM selection is governed by two requirements: reliable function-calling (so the model emits well-formed tool invocations) and data-residency control (so confidential data is processed within an approved trust boundary). The deployment has evolved on both axes (Table 2). The initial system—reported in the conference version [1]—served an open-weight, instruction-tuned model from the Mistral family [34,35], with Llama [36] as an alternative, locally on a GPU cluster for any request touching sensitive data, with a cloud model used only for non-sensitive inference; local serving of capable open-weight models—made practical by accelerated hardware and serving runtimes—is what enabled a private agentic environment. The current production system uses managed Anthropic models served through Amazon Bedrock: a lower-latency tier handles the bulk of interactions and a higher-capability tier handles more demanding reasoning, both operated within the operator’s cloud tenancy. This trades on-premises execution for the throughput, reasoning quality, and operational simplicity of a managed service. In the current managed-cloud configuration, requests are processed within the operator’s approved AWS account, region/inference profile, network controls, and contractual data-processing terms. The us. prefix used by Amazon Bedrock denotes a US cross-region inference profile: Bedrock confines cross-region routing to AWS Regions within the US geography, so that—although request data is stored only in the source Region—input prompts and output results may be processed in another US Region during cross-region inference, under the configured AWS controls [37]; workloads requiring strict on-premises processing remain routed to the local open-weight deployment. The router’s data-sensitivity-aware design—and the option of local open-weight serving—remains available where strict on-premises residency is required.

3.5. Tool Integration and Data Sources

The available tool set is the principal determinant of what the agent can accomplish, because the Bot must interface with APIs to obtain the context for each query. In our deployment the tools include: a network source-of-truth/inventory system (NetBox [38]) holding device location, capabilities, and geo-information; monitoring systems including continuously running ping/latency probes; control-plane visibility tools that expose the link-state database (LSDB) via API (e.g., a route-explorer service [39]); AIOps and event-correlation platforms [40]; log aggregation; DDoS protection; and a central configuration-management orchestrator that stores network configuration as YAML files in Git. Real-time telemetry is ingested over Apache Kafka topics, combining data-plane signals—SNMP traps [41] for backbone interface flaps—with control-plane visibility obtained via the BGP Monitoring Protocol (BMP, covering BGP and IGP, with the relevant Address Family Identifiers) [42]. A digital twin of the topology, implemented with the NetworkX graph library, supports Shortest-Path-First (SPF) and Equal-Cost Multi-Path (ECMP) computations used for pre-change validation.

3.6. Retrieval-Augmented Subsystem

Several of the use cases below (Section 3.7) depend on retrieving relevant prior knowledge rather than only querying live APIs. The retrieval-augmented subsystem implements a hybrid, multi-source pipeline. Heterogeneous knowledge sources—historical incident records, operational documentation and runbooks, change and ticket records, and operational chat archives—are each chunked, embedded, and indexed in the vector store (Section 3.2) as a separate per-source collection, preserving source provenance.
Hybrid retrieval. For each query the subsystem computes a dense semantic vector (a managed multilingual embedding model; Appendix A.4) and a sparse lexical vector—a learned sparse expansion (SPLADE [43]) for conversational and technical sources, and classical BM25 for structured documentation—and searches every collection on both representations. The dense and sparse rankings are merged by Reciprocal Rank Fusion (RRF) [44], score ( d ) = r 1 / ( k + rank r ( d ) ) with k = 60 , which is robust to scale differences between rankers and needs no per-source weight tuning; the fused results are aggregated across collections with source tags to maintain diversity.
Reranking, enrichment, and synthesis. A neural cross-encoder reranker (Appendix A.4) reorders the fused candidates by query relevance and retains only the top few above a relevance threshold, sharply trimming the context passed downstream. Optionally, a knowledge-graph layer (Section 3.5) augments retrieved incident and ticket items with related entities and relationships that are appended to the context. The cloud model (Section 3.4) then synthesizes a grounded answer with inline, per-source citations; when no collection yields relevant evidence, the subsystem returns an explicit “no relevant information” response rather than an unsupported one. The exact models and search parameters are listed in Appendix A.4.

3.7. Operational Use Cases

The deployed system has grown well beyond its initial handful of capabilities: the supervisor now routes across fourteen specialized domain workers, complemented by dedicated expert agents (change validation and configuration audit) and autonomous fast-path agents (self-healing and network troubleshooting), backed by a large operational tool estate (on the order of two hundred typed tool integrations). Table 3 summarizes this capability set. The current call path is no longer a direct “Bot calls one API” pattern. A request first passes through the context summarizer and supervisor; the supervisor selects one or more workers from Table 3; each worker invokes typed MCP/REST tools; the returned observations are assembled into an evidence package; and an LLM synthesis step produces the grounded response or, for autonomous fast-path agents, hands the action to a policy gate before execution. In this section we describe seven representative use cases—spanning information retrieval, detection, and closed-loop remediation—plus a composite ChatOps aggregator workflow, because these have measured per-event handling times (Section 4); the remaining workers follow the same agentic pattern (Figure 3).

3.7.1. Spare Locator

Node failures, while sometimes transient and resolved by a software change, are in large networks frequently due to faulty hardware that must be replaced. A spare is not always co-located with the failed part, so the engineer must locate the nearest compatible device or an equivalent that supports the required functionality—a time-consuming search across inventory. In the current flow, the supervisor routes the intent to the Inventory & infrastructure worker. That worker calls the inventory/source-of-truth tools to retrieve the failed device’s site, hardware family, line-card or optic requirements, and operational state, then ranks candidate spares using compatibility constraints and geo-distance. The worker returns a grounded evidence package containing the selected spare, location, and rationale, which the synthesis step posts back to the thread. This reduces the task from roughly 900 s to about 10 s per event.

3.7.2. Console Information Retrieval

Verifying the operational status of spare or newly racked devices during an incident requires out-of-band console access information. In the current deployment, this is handled by the Device-access management worker rather than by a direct spreadsheet-style lookup. The worker resolves the device identifier, calls the authoritative console-access/source-of-truth tools through the MCP/REST layer, and, where necessary, uses retrieval over indexed access notes to normalize aliases and rack-location descriptions. Secrets are not exposed to the prompt; the response contains only the access metadata permitted for the requesting channel and role. This reduces retrieval time from about 30 s to roughly 3 s.

3.7.3. Packet Loss and Latency Analysis

Latency and packet loss are major incident drivers, with causes ranging from contaminated optics (SFPs) and upstream fiber cuts to interface errors, sub-optimal routing, control-plane issues, and misconfiguration. In a global network, mesh ping probes run continuously between locations, but the raw telemetry and JSON returned by the probe APIs are not directly consumable by an engineer. The supervisor routes these intents to the Network troubleshooting / RCA worker, which calls the probe, monitoring, and AIOps-context tools, aligns measurements to the incident window, and packages the relevant loss, latency, source/destination, and time-range observations. The LLM synthesis step then converts this evidence into a concise operational summary, reducing handling time from approximately 3600 s to about 15 s.

3.7.4. Control-Plane Path Retrieval

During major incidents, control- and data-plane state is decisive. For common service-provider routing protocols, control-plane reachability can be read from a single node via the LSDB, which exposes adjacency changes, neighbor information, k-shortest paths, metrics toward destinations, redundant paths, router isolations, path changes, and flaps. The Network infrastructure worker calls the route-explorer/LSDB-history tools and, where needed, the topology graph/digital-twin tools to compute the requested point-in-time view. The evidence package includes the selected path, alternative paths, IGP metrics, adjacency state, and any isolation or flap indicators; the LLM translates this into an engineer-readable explanation. Because LSDB history is retained for months, the same workflow supports retrospective troubleshooting that is infeasible manually. Path retrieval drops from roughly 3600 s to about 5 s.
Figure 4 summarizes the shared call flow for the four information-retrieval workflows above.

3.7.5. Node-Isolation Detection

Maintenance for patches and fixes is unavoidable and is a primary cause of outages, often through improper router-isolation procedures intended to prevent service impact. Detecting isolation events—before, during, or after maintenance—and promptly paging the right engineers is therefore critical. In the current workflow (Figure 5), the Network troubleshooting / RCA fast-path agent consumes route-explorer and AIOps signals during the maintenance window, calls the route-explorer/LSDB tools to test for isolation, and correlates the result against planned maintenance metadata and service-impact context. If the impact exceeds the plan, the incident-management gate authorizes paging and notification through the service/incident-management tools; otherwise the agent continues monitoring. This reduces reaction time from about 7200 s to roughly 3 s.

3.7.6. Vendor Knowledge-Base Search

Multi-vendor networks reduce operating expenditure and improve resilience by isolating software faults to specific equipment, but they increase troubleshooting complexity and demand vendor-specific expertise. Vendors maintain extensive knowledge bases, support portals, and case histories that describe problem signatures for known issues. In the current flow (Figure 6), the supervisor routes a vendor-related incident to the Vendor support or Vendor insights worker. The worker collects relevant logs and device metadata, identifies the vendor and platform, runs retrieval over indexed vendor knowledge sources and internal case history, and optionally opens or updates a vendor-support/RMA workflow. The synthesis step summarizes likely matches, affected software/hardware, and next actions for the engineer, reducing research time from about 10800 s to roughly 30 s.

3.7.7. Backbone Link-Flapping Detection and Closed-Loop Remediation

Backbone link flaps are a normal occurrence in large regional and global networks, and handling them manually demands a series of slow, error-prone steps: reviewing alerts, logging into multiple systems, running diagnostics or simulations by hand, editing configuration, and pushing changes—typically about 30 minutes per event. This is hard to automate safely because the dampening mechanisms in standard routing protocols lack the intelligence to consider control-plane isolation when reacting to flaps.
Our workflow (Figure 7) closes the loop while preserving safety. Data-plane flap events (SNMP traps) and control-plane state (BMP/BGP+IGP) stream over dedicated Kafka topics and are mapped to a unified LinkEntity datastore. When the observability application detects a threshold breach (e.g., 3 up/down events in < 60  s), the Self-healing autonomous agent receives the source and target nodes and invokes the Network troubleshooting / RCA tools to confirm the event context. The agent then follows a simulation-first procedure: using the digital twin, it issues REST calls that compute SPF/ECMP over the current topology (edges weighted by IGP metric), simulates raising the impacted link’s metric, and verifies that traffic fails over to a backup path without isolating any node (also checking, e.g., whether a node has its overload bit set). Only after the incident-management policy gate confirms that the candidate action is in scope and the simulation returns success does the agent invoke the configuration-management function (via REST) against the central orchestrator, passing the source, target, and a new metric (e.g., static:20000); the orchestrator updates the YAML in Git and commits the change. Operations teams are notified on a dedicated channel. End to end—from alert to validated configuration push and notification—the process completes in roughly 3 minutes, an order-of-magnitude reduction from the manual baseline, while the digital-twin pre-check enforces the guardrail constraint that no change may isolate a node.

3.7.8. ChatOps Multi-Tool Aggregator

Finally, we define a predefined workflow for the case where a network operations team escalates to the backbone/SRE group via chat (Figure 8). When the Bot observes an escalation thread tagged with a customer identifier (UUID), the supervisor dispatches a customer-scoped parallel query across the Service & connection lookup, Service & incident management, Network infrastructure, and Utilities & knowledge base workers. Each worker calls its typed tools and returns observations with source provenance; the synthesis step reasons over the combined result and posts a consolidated, customer-scoped picture directly into the thread. By aggregating several of the use cases above behind one trigger, low-level data from many tools is surfaced on a single thread without the engineer visiting multiple dashboards, which compresses MTTR for complex escalations.

3.8. Evaluation Methodology

We evaluate the approach along three axes: operational efficiency, observed output-quality signals, and impact on service availability.

3.8.1. Experimental Environment and Data Collection

The system was deployed in a major global interconnection and network environment. We frame this work as a production case study: the evidence is operational telemetry and representative timings drawn from a live deployment rather than the output of a controlled, repeated-measures experiment. Our quantitative results draw only on measured production data: observability traces from Langfuse over a 90-day window, which yield the task frequencies, volumes, token-spend signal, and user-feedback signal; and the per-event handling times for the manual and the agentic procedure, recorded as representative operational timings in the deployment study. We do not introduce assumed or synthetic values. We note plainly, however, that detailed per-task sample sizes, distributions, and confidence intervals were not captured under a controlled repeated-measures design; accordingly, the per-event reductions should be read as operational case-study evidence rather than as the result of a controlled experiment.

3.8.2. Time-Savings Measurement Protocol

For each use case, the manual handling time τ k manual was measured as the time an experienced engineer required to complete the equivalent procedure using the existing tools and dashboards, and the agentic handling time τ k auto as the wall-clock time from issuing the natural-language intent to receiving the grounded response (including tool-call and inference latency). The reduction factor for each task is τ k manual / τ k auto . These figures are operational measurements—representative per-event timings from the deployment study—and not the means of a controlled, repeated-trial protocol; per-task sample sizes, distributions, and confidence intervals were not separately recorded, so the resulting reduction factors should be interpreted as operational case-study evidence rather than as statistically characterized effect sizes. We report these per-event times and reduction factors (Table 4) and, separately, the production task frequencies and volumes observed in Langfuse (Section 4.2); we deliberately do not combine them into a single weekly-effort aggregate, because the per-event handling times and the production frequencies were measured over different periods.

3.8.3. Output-Quality Instrumentation

Because efficiency gains are meaningful only if the agent’s outputs are correct and safe, we instrument the deployment with Langfuse [18] for end-to-end tracing and observability and assess quality through two complementary signals. First, LLM-as-a-judge scoring rates responses on dimensions including helpfulness, correctness, hallucination (an inverse-groundedness signal), relevance, context relevance (whether the retrieved passages are on-topic for the query), and conciseness. Second, explicit user feedback—a thumbs-up/down rating captured through the chat interface and recorded via the feedback API—provides a direct, in-the-loop measure of operator-perceived quality. Langfuse additionally records per-trace latency and token usage (prompt and completion tokens); the interaction history is retained internally for audit under the operator’s data-governance policy and is access-controlled. Together these operationalize responsible-AI principles—explainability (every step is traced and attributable), robustness, and transparency—for a production deployment. We apply these judge dimensions both to production traffic and, at scale, to a fixed offline benchmark of 200 operationally realistic questions answered by the production primary model over the live retrieval corpus; the resulting values are reported in Section 4.2 and Section 4.3, respectively. For the offline benchmark, each answer is scored by an LLM-as-a-judge configured within the Langfuse evaluator framework across the six dimensions above: the judge receives the question, the generated answer, and the retrieved passages, and emits a structured per-dimension score in [ 0 , 1 ] . We report the per-dimension mean over the fixed 200-question set together with a 95% confidence interval computed under a normal approximation (per-dimension standard deviations are also reported). The exact judge model and its decoding configuration are part of the deployment’s evaluation configuration; we treat the judge as an automated, model-based signal rather than human-validated ground truth, and establishing a human-labeled subset to validate and de-bias it remains future work.

3.8.4. Availability Modeling

To connect efficiency to service quality, we model availability with Equation (1) and propagate it along a representative end-to-end service path using a series–parallel reliability formulation (Section 4.4). For tasks that lie on the MTTR critical path, reducing τ k reduces MTTR and therefore raises per-service availability for a fixed MTBF.

4. Results

4.1. Operational Efficiency

Table 4 reports, for each use case, the representative per-event operational timing for the manual and the agentic procedure and the resulting reduction factor.
Figure 9 visualizes the comparison on a logarithmic scale.
Per-event reductions span from 10 × (console information) to 2400 × (node isolation): the agentic procedure shortens the observed handling time for each task from minutes–to–hours of manual tool navigation down to seconds. The largest per-event reductions accrue to the tasks whose manual procedure requires serial interrogation of many devices or dashboards (node isolation, 2400 × ; path retrieval, 720 × ), which the deployed workflow replaces with API-backed lookup or parallelized calls. Beyond these six tasks, the closed-loop backbone link-flapping remediation (Section 3.7.7) reduces per-event handling from approximately 30 minutes to about 3 minutes (a 10 × reduction) while additionally reducing the risk of manual misconfiguration through digital-twin validation; the ChatOps aggregator (Section 3.7.8) compounds these savings for multi-tool escalations by surfacing customer-scoped data on a single thread. The production volume and task distribution over which these per-event savings apply are reported in Section 4.2. These reductions measure the end-to-end deployed system; this study does not isolate the incremental contribution of LLM reasoning over deterministic API automation. Even if manual times were 50 % lower and agentic times doubled, the largest reductions—node isolation, path retrieval—would remain order-of-magnitude and material.

4.2. Observed Quality Signals

Efficiency is only valuable if the agent’s responses are useful and safe to act on. We therefore report the quality signals that the production deployment exposes through its observability layer, traced with Langfuse (Section 3.8.3) over a representative 90-day window; Table 5 summarizes the measurements. We emphasize at the outset that these are observed signals from production telemetry, not the output of a controlled evaluation: they have not been independently validated against ground truth, and we are careful throughout to distinguish what the signals show from broader claims about correctness or safety that they do not, on their own, support.
Over this window the production ChatOps assistant served on the order of 7 , 400 interactions, the large majority (about 91 % ) initiated through the Slack ChatOps channel and the remainder dominated by automated tool invocations (e.g., incident lookups); an additional, far larger volume (∼ 1.3 × 10 5 interactions) was generated in the pre-production environment during testing, exercising the multi-tool, function-calling path at scale. The principal in-production quality signal is explicit user feedback, a voluntary thumbs-up/down rating. This signal is sparse: only 168 of the ∼ 7 , 400 interactions were rated, a response rate of approximately 2.3 % . Such a low and self-selected response rate is itself a limitation—raters are not a random sample of interactions, and response (self-selection) bias may skew the rated subset—so the figures below describe the rated responses rather than the full traffic. Of these 168 ratings, 104 were positive and 64 negative, a positive share of 104 / 168 = 61.9 % (mean score 0.62 ); under a normal approximation the 95 % confidence interval for the positive share is approximately [ 54.5 % , 69.2 % ] . Because this interval lies entirely above 50 % , the data support the modest conclusion that a majority of rated responses were judged positive by their raters; they do not establish a broad correctness or safety property, which would require ground-truth labels over a representative sample rather than voluntary ratings. Model-token spend is recorded in the observability layer as prompt and completion tokens per trace, separated by serving tier where applicable; the manuscript reports this token-usage signal rather than a monetary price, because monetary cost is operator- and contract-specific.
Inspecting the production conversation traces over the same window characterizes the deployed task mix. The large majority—approximately 58 % (∼ 4 , 300 traces)—are automated self-healing and monitoring evaluations (link-stability, database-drift, and periodic fabric self-analysis checks), the great majority of which conclude that no action is required. The remaining ∼ 42 % are engineer-initiated operational queries, dominated by incident management (incident search, impact analysis, and status updates—most conversations reference one or more incident records), with latency and link-performance analysis (∼575 traces over the window), path/LSDB-drift investigation (∼300), inventory/spare and out-of-band console lookups (∼75 each), and change validation and vendor TAC/RMA as smaller categories. This production mix is broader and more incident- and self-healing-centric than the six illustrative use cases of Section 3.7, which were chosen to exhibit distinct agent capabilities rather than to mirror production volume; the per-event reduction factors of Table 4 therefore characterize each capability individually, while this distribution reflects how often each is exercised in production.
The incident stream that dominates this production mix originates upstream of the agentic system, in the AIOps event-correlation platform [40] that fuses telemetry into incidents (Section 3.5). Measured platform counts over a representative 30-day window quantify the workload the agent triages (Table 6). In that window the correlation platform processed approximately 1.92  million anomalies together with approximately 716 , 000 monitoring-system alerts—about 2.64  million raw signals—which it distilled to 7 , 885 high-impact alerts and 4 , 382 auto-generated incidents (about 146 per day). This ∼ 600 : 1 figure is the combined raw signal volume (anomalies plus monitoring-system alerts) divided by incidents created; the high-impact-alert and actionable-event counts are reported as separate, independently measured categories rather than strictly nested funnel stages, so the figure is an order-of-magnitude characterization rather than a single monotonic pipeline. Whether the anomaly and alert streams are de-duplicated against one another before being summed is not established here. It is this stream of incidents—rather than the raw signal volume—that the agentic ChatOps system consumes for triage, impact analysis, and status reporting, consistent with the incident- and self-healing-centric task mix (∼ 58 % automated self-healing/monitoring; incident management dominating the engineer-initiated remainder) observed in the Langfuse traces above. Correlation is what makes this tractable: a single physical fault typically manifests as many concurrent signals, which the platform groups into one incident. In one representative case, six raw events—two interface-down (if_down), two bidirectional-forwarding-detection-down (bfd_down), one log-count anomaly, and one IS-IS-adjacency-down (isis_down)—were correlated into a single incident, sparing the on-call engineer (and the agent) six separate, redundant notifications.
The production user-feedback signal is the first of two complementary quality measures; because it is sparse and self-selected, we pair it with an at-scale offline benchmark, reported next.

4.3. Offline RAG-Quality Benchmark

To assess answer quality at scale and along multiple dimensions, we complement the production feedback with an offline benchmark over a fixed set of N = 200 operationally realistic questions drawn from real incident scenarios (for example, correlating interface-down and bidirectional-forwarding-detection events, diagnosing storage-fabric health, and verifying circuit status). Each question is answered end-to-end by the deployed retrieval pipeline (the hybrid dense/sparse retrieval with neural reranking of Section 3.6), using the production primary managed-cloud model tier (Amazon Bedrock; Section 3.4) over the live multi-source corpus—a customer-relationship system, an IT-service-management/ticketing system, an internal knowledge base, and operational chat archives—and each response is then scored by the Langfuse LLM-as-a-judge evaluators (Section 3.8.3) along six dimensions. All 200 queries returned an API response (pipeline completion, not a correctness measure; here a failure would be a request timeout, a tool-call error, an empty retrieval, a malformed answer, or a judge failure), retrieving on average 6.7 supporting passages per query—the passages surfaced and cited across the source collections, whereas the reranker further trims the fused candidates to a smaller top-N before synthesis (Appendix A.4), so the two counts describe different pipeline stages—at a median end-to-end latency of 7.0  s and a mean answer length of ≈ 1 , 400 characters. Table 7 reports the per-dimension mean with a normal-approximation 95 % confidence interval over the 200 questions.
The grounded pipeline scores highly on relevance ( 0.85 ) and context relevance ( 0.79 ), indicating that retrieval surfaces on-topic passages and that answers stay on-topic, with correctness ( 0.74 ) and the groundedness/no-hallucination score ( 0.71 ) also in the upper range—consistent with multi-source grounding keeping answers tied to retrieved evidence. Conciseness is the clear outlier ( 0.33 ): the system favors comprehensive, multi-source synthesis (mean answer length ≈ 1 , 400 characters) over brevity, which the judge penalizes. We read this as a genuine, actionable finding—an opportunity to tune answer length—rather than an artifact hidden by averaging. This low conciseness score ( 0.33 ) is an immediate prompt- and UX-level tuning opportunity (answer-length control), achievable without retraining.
Three caveats bound these numbers. First, this is an offline benchmark over a curated question set and is therefore distinct from—and complementary to—the production traffic of Section 4.2; the two reflect different populations. Second, the scores are produced by an LLM-as-a-judge rather than by human-labeled ground truth, and so inherit the judge model’s biases; they are best read as relative, dimension-level indicators rather than as absolute accuracy. Third, the benchmark exercises the primary high-volume managed-cloud tier; the reasoning-tier model and the local open-weight configuration were not separately benchmarked. Fourth, the 200 questions were drawn from real operational incident scenarios and phrased as natural-language queries; their exact sampling frame, stratification by task type, and identifier anonymization are part of the (proprietary) evaluation-set construction and are not released, and we did not separately audit for temporal leakage between a question and corpus documents created after the underlying incident. A paired comparison against a non-grounded baseline and a human-rated subset remain natural next steps (Section 6).

4.4. Service Availability Impact

For the tasks that lie on the MTTR critical path, the time reductions in Table 4 propagate to service availability through Equation (1). To make this concrete, consider a representative layer-2 production service connecting two major locations, London and Singapore (Figure 10). The path comprises edge and backbone components in series, with redundant spine pairs in parallel at each metro.
The end-to-end availability (excluding link contributions) composes the series elements and the parallel spine pairs as
A service = A Lon , SE · 1 ( 1 A Lon , SP 1 ) ( 1 A Lon , SP 2 ) · A Lon , BB · A Sin , BB · 1 ( 1 A Sin , SP 1 ) ( 1 A Sin , SP 2 ) · A Sin , SE ,
where A x is the availability of component x (SE: service edge; SP: spine; BB: backbone). Equation (2) shows that each component’s availability multiplies into the end-to-end result; reducing any component’s MTTR therefore raises A service .
Equation (2) is a structural model: producing a numerical availability requires operator-specific per-component MTBF and MTTR values, which are not available from the transaction data used in this study, so we do not report a specific availability figure. Under these assumptions, the qualitative consequence is monotonic: because each component’s availability enters multiplicatively and is monotonically increasing in the ratio MTBF / ( MTBF + MTTR ) (Equation (1)), the measured per-event time reductions (Section 4.1), insofar as they shorten the diagnostic portion of MTTR on the critical path, raise each component’s availability and hence the end-to-end service availability.
In summary, by reducing the time engineers spend on spare search, console-information retrieval, packet-loss/latency analysis, path retrieval, node-isolation verification, and vendor research, the approach can lower MTTR when these tasks are on the incident critical path and thereby support modeled improvements in end-to-end service availability, with direct consequences for SLA compliance.

5. Discussion

5.1. Principal Findings

The central result is that intent-driven, agentic execution is associated with per-event wall-clock times one to three orders of magnitude lower for common network-operations tasks ( 10 × to 2400 × across the six recurring tasks studied). Two mechanisms explain the magnitude. First, much of the manual cost is navigation and translation overhead—logging into the right tool, locating the relevant datum, and mentally converting raw API/JSON output into an operational picture—rather than irreducible cognitive work; the agent collapses this overhead by calling the API directly and using the LLM as a translation layer. Second, the tasks with the largest reductions (node isolation, 2400 × ; path retrieval, 720 × ) are precisely those whose manual procedure requires serial interrogation of many devices or many dashboards, which the deployed workflow parallelizes or replaces with an API-backed lookup against a system of record. These per-event reductions apply across a production workload that, over the 90-day observation window, comprised roughly 7 , 400 interactions dominated by incident management and automated self-healing checks (Section 4.2); the operational impact is therefore largest for the high-volume, high-per-event-cost tasks.
For tasks that lie on the MTTR critical path, the time savings support modeled availability gains—under a fixed-MTBF model, when the saved time lies on the incident critical path—through the availability model (Section 4.4). The practical implication is that agentic ChatOps is not merely an engineer-convenience feature but a potential lever on SLA-relevant service quality when these tasks lie on the incident critical path.

5.2. Comparison with Related Work

Relative to the systems surveyed in Section 2, our contribution is distinguished less by any single technique than by their integration into a deployed, carrier-scale system. Incident-management copilots such as Nissist [29] and query-recommendation systems such as Xpert [30] focus on a single phase of the workflow; RCA agents such as RCAgent [27] and the agents explored by Roy et al. [32] target diagnosis; and NetLLM [23] adapts a model to networking tasks but is evaluated offline. Our system spans information retrieval, detection, and closed-loop remediation that pushes a verified configuration change, and it does so with data-residency-aware model deployment—two capabilities that, to our knowledge, are not jointly demonstrated in prior production-oriented work (Table 1). The digital-twin pre-check that gates configuration changes is, in particular, a concrete instantiation of the guardrail constraint C from Section 3.1 and a differentiator from intent-management proposals that lack a validated safety envelope.

5.3. Security, Privacy, and Responsible AI

Operating an LLM agent inside a carrier network raises data-governance concerns that shaped the architecture. The intelligent query router classifies each request by data sensitivity and directs it to an appropriate model deployment. In the initial system this meant serving sensitive requests from a locally hosted open-weight model so that confidential content never left the operator’s premises; the current system instead uses managed cloud models (Section 3.4): in this configuration, requests are processed within the operator’s approved AWS account, region/inference profile, network controls, and contractual data-processing terms for managed Anthropic models on Amazon Bedrock—a deliberate trade of on-premises control for managed-service capability, with workloads requiring strict on-premises processing remaining routed to the local open-weight deployment. Credentials are held in a secrets-encryption platform rather than embedded in prompts, code, or logs, which limits the blast radius of prompt-leakage or log-exfiltration attacks. Every interaction is traced and retained for audit, and a hallucination/groundedness judge signal flags unsupported responses. Collectively, these measures operationalize responsible-AI principles—explainability (every step is logged and attributable), robustness, and transparency—for a production deployment.
Several residual risks warrant attention. Prompt injection through tool outputs (e.g., a crafted log line or a malicious knowledge-base article) could attempt to subvert the agent’s plan; mitigations include treating tool output as untrusted data, constraining the tool schema, and validating any state-changing action against an independent check. Over-broad tool permissions are another risk: the agent should operate under least-privilege API scopes, and write-capable tools (configuration management) should be reachable only through the validated remediation path. The control set currently deployed or available in the system reflects these concerns. Concretely: (i) tools are exposed to the agent through an explicit allow-list, and API scopes are partitioned by least privilege so that read and write capabilities are granted separately rather than as a single broad credential; (ii) tool output is treated as untrusted input and is parsed against a fixed response schema before it can influence the agent’s plan; (iii) the human-approval gate applies to the general change-management path—candidate changes are raised as change requests and are never auto-submitted, requiring change-advisory-board or explicit operator approval before any such write—whereas the narrow link-flap self-healing action (raising the affected link’s IS-IS metric to shift traffic off it) executes autonomously once it passes the digital-twin SPF/ECMP pre-check, gated by that simulation rather than by a human, and bounded by notification to an operations channel and a version-controlled rollback; and (iv) configuration is version-controlled with an audit log retained for every interaction, providing attribution and a revert point. We additionally treat systematic prompt-injection red-teaming—adversarial probing of tool outputs and retrieved content—as an ongoing, planned exercise rather than a completed evaluation, and we do not yet report quantitative robustness results from it.

5.4. Reliability and Safety of Autonomous Actions

The link-flapping workflow (Section 3.7.7) embodies the safety philosophy of the system: simulate before acting. Before any metric change is committed, the digital twin computes SPF/ECMP over the live topology and verifies that the candidate change neither isolates a node nor removes the only viable backup path. This shifts the agent from an open-loop assistant to a closed-loop controller with a verifiable safety envelope, and it confines autonomy to changes that have passed an explicit pre-check. Two design choices reinforce reliability: configuration is stored as version-controlled YAML in Git, giving every change an audit trail and a revert point; and notifications are posted to a dedicated operations channel so that humans retain situational awareness. For higher-risk or ambiguous actions, a human-in-the-loop approval gate can be inserted at the act step of Figure 3 without altering the rest of the design. We regard robust automated rollback—failover if a post-change anomaly is detected—as an important next step (Section 6).

5.5. Generalizability

Although evaluated in one global interconnection environment, the design is deliberately tool-agnostic: the agent interacts with capabilities through typed function calls, and the specific backends (inventory, monitoring, route explorer, configuration manager) are interchangeable behind those interfaces. The seven detailed use cases were chosen to span the operational space—inventory lookup, RAG over a source of truth, telemetry summarization, control-plane introspection, event detection, knowledge-base retrieval, and closed-loop remediation—so that the patterns transfer to other carrier and large-enterprise networks even where individual tools differ. The principal porting effort is integration engineering (API adapters, embeddings of the local source of truth) rather than redesign, which is consistent with our observation in Section 2 that current agentic frameworks are powerful but not “plug-and-play” for a given vertical.

5.6. Token Usage and Operational Considerations

The two deployment modes carry different operational profiles. Hosting an open-weight model on a GPU cluster incurs capital and energy overhead but eliminates managed-service token charges and data-egress risk; a managed cloud service (the current deployment) avoids the GPU footprint but makes token spend a first-class operating signal. In practice, the observability layer records prompt and completion tokens for each trace (Section 4.2), allowing the operator to attribute token usage by channel, task class, and serving tier. This is more portable than reporting a monetary price, because cost depends on the operator’s contract, region, model mix, and discount structure. Set against the operational benefit—per-event time reductions of one to three orders of magnitude (Section 4.1) applied across a production workload of thousands of interactions per quarter (Section 4.2)—token spend remains the measurable quantity to monitor when deciding which requests should use a local model, a high-volume managed tier, or a higher-capability reasoning tier. A full total-cost-of-ownership analysis (GPU amortization or managed-service token fees, energy, and serving overhead versus reclaimed labor and SLA credits avoided) is operator-specific and a useful direction for future quantification.

5.7. Limitations and Threats to Validity

Several limitations qualify these results. No automation baseline. The comparison is manual versus agentic handling, and therefore measures the end-to-end benefit of the deployed system as a whole; it does not isolate the contribution of LLM-based agentic reasoning from that of ordinary deterministic automation (scripts, API orchestration, or robotic process automation) that could perform parts of the same procedures. Consequently, the per-event reductions are most defensibly attributed to an integrated ChatOps automation system with an LLM-mediated interface rather than to LLM agency in isolation; an ablation against a scripted, non-LLM automation baseline would be required to separate these contributions and is left as future work. Construct and internal validity: the manual handling times are representative operational measurements rather than the output of a controlled, repeated-measures experiment, so they carry variance that the present study does not fully characterize. In particular, per-task timing statistics (sample size n, median, interquartile range, and confidence intervals) and warm- versus cold-cache effects were not recorded, and failed runs, retries, and hallucination incidents were not separately accounted for in the timing; a pre-registered protocol with multiple operators, repeated trials per task, explicit cache-state and failure/retry accounting, and reported confidence intervals would tighten the estimates and is recommended as a future controlled evaluation. Modeled (not measured) availability: the availability consequence is modeled via Equation (1) and the series–parallel formulation (Section 4.4), not measured; we do not report a measured incident MTTR or a before-and-after availability figure, so the availability claim is a structural inference from the measured time savings rather than an observed outcome. External validity: the data derive from a single (large) operator, and frequencies and times reflect that environment’s tooling and traffic; other networks may differ. LLM-specific threats: LLM outputs are non-deterministic and can hallucinate [20]; while grounding, retrieval, and the Langfuse LLM-as-a-judge checks mitigate this, they do not eliminate it. Evaluation maturity: we report two complementary quality signals—the sparse, self-selected production user-feedback ratings (Section 4.2) and an at-scale offline LLM-as-a-judge benchmark over 200 questions with confidence intervals (Section 4.3). Both carry limits: the user feedback is voluntary and non-representative, while the benchmark scores are model-generated (LLM-as-a-judge) rather than human-validated, cover only the primary model, and are computed over a curated question set rather than production traffic. A paired comparison against a non-grounded baseline and a human-labeled ground-truth subset therefore remain future work. Benchmarking: there is no public benchmark for carrier-network agentic operations, which limits direct comparison with prior systems and motivates community efforts toward shared, privacy-preserving evaluation datasets. Safety coverage: the digital-twin pre-check validates against modeled topology state and can be wrong if the model diverges from reality (e.g., un-modeled state or simulation inaccuracies); conservative thresholds, post-change verification, and rollback reduce but do not remove this risk. Moreover, this article describes the safety mechanism but does not statistically evaluate safety outcomes: counts of autonomous remediations attempted, candidate changes accepted versus rejected by the digital-twin pre-check, rollbacks, and post-change anomalies are not reported, and a quantitative safety evaluation is left to future work. We return to these in the future-work discussion.

6. Conclusions

Large-scale network operations are constrained by tool sprawl and by the manual effort required to navigate, query, and correlate across many heterogeneous systems—effort that lengthens MTTR and erodes service availability. We presented an agentic ChatOps system that reframes these tasks as intent-driven workflows: a ChatOps front end and a supervisor of specialized LLM agents plan, call tools through function calling, retrieve from a network source of truth, and—behind a digital-twin safety check—close the loop on remediation. The architecture pairs an intelligent query router—directing requests by data-sensitivity class across a locally hosted open-weight model and managed cloud models—with end-to-end Langfuse observability.
Evaluated across seven representative use cases—within a deployed set of fourteen specialized workers and dedicated expert and autonomous agents—in a major global network, the approach reduced per-event handling time by factors ranging from about 10 × to 2400 × and shortened the backbone link-flap remediation cycle from roughly 30 to 3 minutes while reducing exposure to manual configuration edits through digital-twin validation and version-controlled change. Over a 90-day production window it handled roughly 7 , 400 ChatOps interactions—dominated by incident management and automated self-healing checks—with positive user feedback on the majority of rated responses. For tasks that lie on the incident critical path, the savings support modeled availability gains—under a fixed-MTBF model, through a series–parallel availability model—with modeled implications for end-to-end service availability and SLA compliance. More broadly, the results show that the integrated agentic ChatOps automation system—combining deterministic elements (predefined workflows, digital-twin validation, version-controlled configuration) with generative reasoning—bridges the operational gap between complex API-driven systems and engineers, in a deployment designed to preserve confidentiality under the configured trust boundary and designed to preserve the operational-safety properties that carrier environments demand.
Future work. Several directions follow naturally. (i) Tool determination via function calling: exposing each capability as a distinct callable tool and letting the LLM contextually select the appropriate API—or combination of APIs—for complex, multi-step tasks, reducing the reliance on hand-authored workflows. (ii) Further long-context techniques: building on the current summarization-and-token-budgeting scheme, learned message-importance re-ranking to better preserve salient context during very long escalations. (iii) Robust rollback and post-change verification: automated failover if a post-change anomaly is detected, closing the safety loop beyond the pre-change simulation. (iv) Broader observability and simulation coverage: integrating additional telemetry sources and richer digital-twin models to handle a wider range of network scenarios and to bound simulation inaccuracy. (v) Rigorous and shared evaluation: building on the offline LLM-as-a-judge benchmark reported here (Section 4.3), a human-labeled ground-truth subset, a paired comparison against a non-grounded baseline, a controlled multi-operator timing protocol, and progress toward a privacy-preserving public benchmark for agentic network operations would further strengthen comparability across systems. Pursuing these directions would extend the present system from a high-impact operational tool toward a general, safety-oriented, and verifiable substrate for autonomous network operations.

Supplementary Materials

No supplementary materials are provided.

Author Contributions

Conceptualization, F.P., I.K., U.S., L.I., A.H. and H.K.; methodology, F.P.; software, F.P., I.K., U.S. and L.I.; validation, F.P., E.H., I.K., U.S., L.I., A.H. and H.K.; formal analysis, F.P.; investigation, F.P.; resources, I.K., U.S., L.I., A.H. and H.K.; data curation, F.P.; writing—original draft preparation, F.P.; writing—review and editing, E.H., I.K., U.S., L.I., A.H. and H.K.; visualization, F.P.; supervision, E.H. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was treated as non-human-subjects research under the operator’s data-governance and acceptable-use policies; no formal Institutional Review Board determination was sought. The work analyzes anonymized operational quality telemetry—binary thumbs-up/thumbs-down ratings of assistant responses—generated incidentally during normal use of an internal production tool and collected under those policies. No personal data, demographic attributes, or identifying information were collected, and no interventions or interactions were undertaken with human participants for research purposes.

Data Availability Statement

The operational data analyzed in this study were collected from a production carrier network and, together with the production source code of the deployed system, are proprietary and confidential; they are subject to commercial, security, and customer-privacy restrictions and therefore cannot be made publicly available or shared. To support reproducibility, representative artifacts—illustrative tool schemas, illustrative agent-prompt skeletons, the digital-twin validation logic, and the software stack and model-serving configuration used—are provided in Appendix A, and the aggregate figures and statistics underpinning the reported results are presented within the article. Further aggregated, non-confidential data may be made available from the corresponding author on reasonable request, subject to the operator’s data-governance policies.

Acknowledgments

The authors thank the network operations and reliability engineering teams whose workflows informed the use cases described in this work. During the preparation of this manuscript, the authors used generative-AI tools to assist with drafting and language editing of the text and with the preparation of figures. The authors reviewed and edited all such content and take full responsibility for the content and conclusions of this publication.

Conflicts of Interest

I.K., U.S., L.I., A.H. and H.K. are employed by Equinix. The remaining authors declare no conflicts of interest. The funders had no role in the design of the study; in the collection, analyses, or interpretation of data; in the writing of the manuscript; or in the decision to publish the results.

Abbreviations

    The following abbreviations are used in this manuscript:
AIOps Artificial Intelligence for IT Operations
API Application Programming Interface
BB Backbone
BGP Border Gateway Protocol
BMP BGP Monitoring Protocol
ChatOps Chat-driven Operations
ECMP Equal-Cost Multi-Path
IGP Interior Gateway Protocol
LLM Large Language Model
LSDB Link-State Database
MTBF Mean Time Between Failures
MTTR Mean Time to Repair
MCP Model Context Protocol
NMS Network Management System
NOC Network Operations Center
RAG Retrieval-Augmented Generation
RCA Root-Cause Analysis
REST Representational State Transfer
SE Service Edge
SFP Small Form-factor Pluggable
SLA Service Level Agreement
SNMP Simple Network Management Protocol
SP Spine
SPF Shortest Path First
UUID Universally Unique Identifier
YAML YAML Ain’t Markup Language

Appendix A. Reproducibility

This appendix documents the software stack, model configuration, and the core procedures behind the reported results, so that the design can be reproduced with equivalent open components. To respect the operator’s data-governance policy, all schemas, identifiers, and field names below are anonymized and representative: they preserve the structure of the production artifacts without disclosing internal systems, devices, or customer data.

Appendix A.1. Software Stack and Model-Serving Configuration

Table A1 lists the principal components and their roles without exposing package versions or cloud model identifiers. The agent runtime is built on LangChain/LangGraph with Redis-backed checkpointing for durable conversational state; tools are exposed to the workers through Model Context Protocol (MCP) adapters; Qdrant serves the retrieval index, Neo4j holds the agent-graph state, and the topology digital twin is implemented with NetworkX. Observability is captured through OpenTelemetry instrumentation and Langfuse tracing. The initial (conference [1]) deployment served a local open-weight model through Ollama; the current production deployment serves managed Anthropic models through Amazon Bedrock US cross-region inference profiles [37], with a lower-latency tier handling the bulk of traffic and a higher-capability tier handling harder reasoning. The us. prefix used by Bedrock denotes a US cross-region inference profile, so cross-region routing is confined to United States regions.
Table A1. Software stack and model-serving configuration for the agentic ChatOps deployment. Exact package versions and cloud model identifiers are omitted for security and operational-governance reasons; the table records the component roles needed to reproduce the design with equivalent technologies.
Table A1. Software stack and model-serving configuration for the agentic ChatOps deployment. Exact package versions and cloud model identifiers are omitted for security and operational-governance reasons; the table records the component roles needed to reproduce the design with equivalent technologies.
Component Configuration Role
LangChain Python framework Agent/LLM orchestration framework
LangGraph Stateful agent graph Supervisor/worker routing and execution graph
   langgraph-checkpoint-redis Redis-backed checkpointing Durable checkpointing of agent state
langchain-mcp-adapters MCP adapter layer MCP tool-layer integration
mcp Model Context Protocol runtime Typed tool interface for workers
Qdrant (qdrant_client) Vector database client Vector store for the RAG subsystem
Neo4j (neo4j) Graph database client Graph-structured agent state
Redis In-memory data store Checkpoint / short-term memory backend
FastAPI API framework Service / feedback API
boto3 AWS SDK client Amazon Bedrock client
opentelemetry-instr.-langchain Tracing instrumentation LangChain/LangGraph telemetry hooks
Langfuse [18] Self-hosted trace and evaluation server Trace store, judge and feedback scoring
NetworkX Python graph library SPF/ECMP topology digital twin
Models
Initial (local; sensitive data) Open-weight instruction-tuned model served through Ollama [35] Local GPU serving for confidential requests
Current (cloud; bulk) Managed Anthropic model through Amazon Bedrock High-volume interactions within a US cross-region inference profile
Current (cloud; reasoning) Managed Anthropic model through Amazon Bedrock Complex reasoning within a US cross-region inference profile

Appendix A.2. Supervisor Routing Schema

The supervisor (Section 3.3) is an LLM constrained to structured output. For each natural-language intent it emits a routing decision with three fields: a free-text routing rationale that records why the intent maps to a given domain; a confidence score in [ 0 , 1 ] ; and one or more selected workers drawn from the fixed set of specialized agents. When the intent spans several domains, the supervisor returns multiple workers and they are dispatched in parallel, after which a synthesis step merges their outputs into a single grounded response. A confidence threshold θ gates the decision: routes with confidence at or above θ proceed automatically, whereas low-confidence (ambiguous) routes are not auto-dispatched—the supervisor instead requests a clarification or falls back to a conservative default worker. Schematically, the structured output is
{ rationale: string,
{ confidence: float [0..1],
{ workers: [ {name: enum, args: object}, ...] }
and routing applies the guard auto dispatch confidence θ , with ambiguous intents escalated for clarification.

Appendix A.3. Representative Tool Schema (MCP)

Each worker invokes tools through the MCP tool layer, which presents every backend—inventory, monitoring, control-plane, configuration management, incident management—behind a uniform, typed, function-calling interface. A tool is described to the model by a name, a natural-language description, and a JSON-schema parameter specification; the model emits a structured tool call, and the layer returns a structured observation. A representative (anonymized) read-only tool for control-plane path retrieval is:
name: get_control_plane_paths
description: “Return current shortest/ECMP paths and
description: “adjacency state between two nodes.”
parameters: { source_node: string (required),
parameters: { target_node: string (required),
parameters: { k_paths: integer (default 3),
parameters: { as_of: timestamp (optional) }
returns: { paths: [ {hops: [string], metric: int} ],
returns: { isolated_nodes: [string], backup_available: bool }
State-mutating tools (configuration management) follow the same schema form but are reachable only through the validated remediation path of Appendix A.5, and operate under least-privilege API scopes.

Appendix A.4. Retrieval Pipeline Configuration

The retrieval-augmented subsystem (Section 3.6) is a hybrid pipeline configured as follows; each knowledge source is indexed as a separate vector-store collection with a dense and a sparse vector per chunk.
  • Dense embeddings: a managed multilingual embedding model served through Amazon Bedrock [45,46].
  • Sparse vectors: a learned sparse expansion model (SPLADE [43], naver/splade-cocondenser-ensembledistil) for conversational and technical sources, and BM25 for structured documentation.
  • Fusion: Reciprocal Rank Fusion [44] with k = 60 , with up to 50 candidates retrieved per collection.
  • Reranking: a managed neural reranker served through Amazon Bedrock [47] reduces the fused candidates to a top-N (5 in the production configuration), discarding any with a relevance score below 0.1 .
  • Knowledge graph: an optional graph layer (Section 3.5) returns related entities and relationships for incident and ticket records, and degrades gracefully when unavailable.
  • Synthesis: the configured serving model (Section 3.4; the offline benchmark of Section 4.3 used the primary managed-cloud tier) produces the final grounded answer with per-source citations.
  • Indexing: Qdrant uses cosine similarity over the dense vectors; chunking, metadata filters, and embedding input-type (query vs. document) settings are configurable and follow the per-source ingestion pipeline.

Appendix A.5. Digital-Twin Validation (SPF/ECMP Pre-Check)

Before any link-metric change is committed (Section 3.7.7), the candidate change is validated against the NetworkX digital twin. The pre-check rejects any change that would isolate a node or remove the only remaining backup path. The procedure is:
  • Build twin. Construct a graph G = ( V , E ) from current topology state, with each edge weighted by its live IGP metric; mark nodes carrying an overload (do-not-transit) indicator.
  • Baseline. Compute SPF/ECMP shortest paths over G for the affected source–target demands and record, for each demand, the set of viable next-hops (working and backup paths).
  • Apply candidate. Form G from G by raising the impacted link’s metric to the proposed value (e.g., a large static cost), modeling the intended de-preference of the flapping link.
  • Recompute. Recompute SPF/ECMP over G for the same demands.
  • Isolation check. If any node reachable in the baseline becomes unreachable in G —i.e., the change isolates a node—reject.
  • Backup check. If, for any affected demand, no alternate viable path remains in G (the change removes the only backup, accounting for overload-marked nodes), reject.
  • Decision. Only if every demand retains a valid failover path and no node is isolated, return success; otherwise return reject and notify the operations channel that no safe change exists.
A configuration change is committed (as version-controlled YAML in Git) only on a success verdict; this enforces the guardrail constraint C of Section 3.1 that no change may isolate a node.

Appendix A.6. Per-Event Timing-Extraction Procedure

The agentic per-event handling times (Table 4) are extracted from production observability traces. Each ChatOps interaction is recorded as a single root trace whose child spans cover the constituent steps (routing, retrieval, tool calls, model inference, synthesis). For each event the agentic handling time is the wall-clock span from the root-trace start (the intent arriving at the supervisor) to emission of the grounded response, τ auto = t response t intent , which includes tool-call and inference latency. Traces are grouped by use case using the selected-worker/tool labels recorded on each trace, and a representative per-event time is taken per group. The manual baseline τ manual is measured separately as the time an experienced engineer requires for the equivalent procedure with the existing tools (Section 3.8.2); the reduction factor is τ manual / τ auto .

Appendix A.7. Evaluation Rubric: User Feedback and Judge Scoring

Output quality is assessed with two complementary signals (Section 3.8.3). Explicit user feedback is a binary thumbs-up/down control on each posted response, captured through the chat interface and recorded via the feedback API as a score s { 0 , 1 } ; the reported positive-feedback rate is the mean of s over rated responses, and rating is voluntary, so the rated subset is a self-selected sample. LLM-as-a-judge scoring scores responses against a fixed rubric, each dimension on a normalized [ 0 , 1 ] scale where 1 is best: helpfulness (does it address the intent with an actionable answer?); correctness (is it accurate with respect to retrieved/tool evidence?); hallucination, an inverse-groundedness signal (are all claims supported by the retrieved context and tool observations?); relevance (is the used context pertinent?); context relevance (are the retrieved passages on-topic for the query?); and conciseness (free of redundancy while complete?). Each trace also records latency and token usage. We apply this rubric in two settings: to production traffic, where the voluntary user-feedback signal is the representative measure (Section 4.2); and to a fixed offline benchmark of 200 operationally realistic questions answered over the live retrieval corpus and scored with these LLM-as-a-judge dimensions, reported as per-dimension means with 95 % confidence intervals (Section 4.3). These intervals are normal-approximation intervals computed from the per-dimension means and standard deviations, and the per-dimension standard deviations are available alongside the means. The judge itself is an LLM configured within the observability platform’s evaluator framework. These automated judge scores are related in spirit to RAGAS-style RAG metrics [15] (faithfulness, answer relevance, context precision/recall); a human-labeled ground-truth comparison and a paired non-grounded baseline remain future work.

Appendix A.8. Illustrative Agent-Prompt Skeletons

The following skeletons are illustrative and sanitized: they convey the generic instruction structure of the supervisor and worker prompts without reproducing internal prompt text, system names, or operational content. They are intended only to make the routing and grounding behavior reproducible in spirit. Concrete domains, tool names, schemas, and policy text are operator-specific and are omitted.
Supervisor (router) prompt skeleton. The supervisor is constrained to structured output (Appendix A.2):
You are a routing supervisor for a network-operations
assistant. Given a user intent, select one or more
specialized workers from the fixed worker set and
provide arguments for each.
Return ONLY structured output with fields:
xxrationale (why this routing), confidence in
xx[0,1], and workers: [{name, args}].
If the intent is ambiguous or low-confidence, do not
guess: request a clarification or fall back to a
conservative default worker. Do not invent worker
names or tools outside the provided set.
Worker prompt skeleton. Each worker answers within its domain using only the tool layer and retrieved context:
You are a specialized worker for domain. Use only
the provided tools and the retrieved context to
answer; do not fabricate values, identifiers, or
state. Cite the retrieved sources and tool
observations that support your answer, and state
clearly when evidence is insufficient.
State-changing actions follow the configured
guardrails: general change-management writes are
not auto-submitted and require explicit human
approval, whereas the narrow pre-validated
self-healing action proceeds only after the
digital-twin pre-check (Appendix A.5)
succeeds, and always emits a notification with a
version-controlled, reversible configuration change.

References

  1. Peci, F.; Hamiti, E.; Khan, I. Agentic AI with ChatOps for Large-Scale Network Operations. In Proceedings of the Proceedings of the 2025 IEEE Conference on Artificial Intelligence (CAI). IEEE, Conference version; this article is an extended version; 2025; pp. 1617–1626. [Google Scholar] [CrossRef]
  2. Yao, S.; Zhao, J.; Yu, D.; Du, N.; Shafran, I.; Narasimhan, K.; Cao, Y. ReAct: Synergizing Reasoning and Acting in Language Models. In Proceedings of the Proceedings of the 11th International Conference on Learning Representations (ICLR), 2023. [Google Scholar]
  3. Shinn, N.; Cassano, F.; Gopinath, A.; Narasimhan, K.; Yao, S. Reflexion: Language Agents with Verbal Reinforcement Learning. Proc. Adv. Neural Inf. Process. Syst. 2023, arXiv:csVol. 36. [Google Scholar]
  4. Madaan, A.; Tandon, N.; Gupta, P.; Hallinan, S.; Gao, L.; Wiegreffe, S.; Alon, U.; Dziri, N.; Prabhumoye, S.; Yang, Y.; et al. Self-Refine: Iterative Refinement with Self-Feedback. Proc. Adv. Neural Inf. Process. Syst. 2023, arXiv:csVol. 36. [Google Scholar]
  5. Schick, T.; Dwivedi-Yu, J.; Dessì, R.; Raileanu, R.; Lomeli, M.; Hambro, E.; Zettlemoyer, L.; Cancedda, N.; Scialom, T. Toolformer: Language Models Can Teach Themselves to Use Tools. Proc. Adv. Neural Inf. Process. Syst. 2023, arXiv:csVol. 36. [Google Scholar]
  6. Patil, S.G.; Zhang, T.; Wang, X.; Gonzalez, J.E. Gorilla: Large Language Model Connected with Massive APIs. Proc. Adv. Neural Inf. Process. Syst. 2024, arXiv:csVol. 37, 126544–126565. [Google Scholar] [CrossRef]
  7. Yang, Z.; Li, L.; Wang, J.; Lin, K.; Azarnasab, E.; Ahmed, F.; Liu, Z.; Liu, C.; Zeng, M.; Wang, L. MM-ReAct: Prompting ChatGPT for Multimodal Reasoning and Action. arXiv 2023, arXiv:2303.11381. [Google Scholar]
  8. Huang, X.; Liu, W.; Chen, X.; Wang, X.; Wang, H.; Lian, D.; Wang, Y.; Tang, R.; Chen, E. Understanding the Planning of LLM Agents: A Survey. arXiv 2024, arXiv:2402.02716. [Google Scholar] [CrossRef]
  9. Packer, C.; Wooders, S.; Lin, K.; Fan, V.; Patil, S.G.; Stoica, I.; Gonzalez, J.E. MemGPT: Towards LLMs as Operating Systems. arXiv 2023, arXiv:2310.08560. [Google Scholar]
  10. Wang, L.; Ma, C.; Feng, X.; Zhang, Z.; Yang, H.; Zhang, J.; Chen, Z.; Tang, J.; Chen, X.; Lin, Y.; et al. A Survey on Large Language Model Based Autonomous Agents. Front. Comput. Sci. 2024, arXiv:cs18, 186345. [Google Scholar] [CrossRef]
  11. Wu, Q.; Bansal, G.; Zhang, J.; Wu, Y.; Li, B.; Zhu, E.; Jiang, L.; Zhang, X.; Zhang, S.; Liu, J.; et al. AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation. LLMAgents @ ICLR 2024 workshop oral, 2024. [Google Scholar]
  12. CrewAI. CrewAI: Framework for Orchestrating Role-Playing Autonomous AI Agents. 2024. Available online: https://www.crewai.com/ (accessed on 2025-05-01).
  13. LangChain. LangGraph: Building Stateful, Multi-Actor Applications with LLMs. 2026. Available online: https://docs.langchain.com/oss/python/langgraph/overview (accessed on 2026-06-01).
  14. Gao, Y.; Xiong, Y.; Gao, X.; Jia, K.; Pan, J.; Bi, Y.; Dai, Y.; Sun, J.; Wang, M.; Wang, H. Retrieval-Augmented Generation for Large Language Models: A Survey. arXiv 2023, arXiv:2312.10997. [Google Scholar]
  15. Es, S.; James, J.; Espinosa-Anke, L.; Schockaert, S. RAGAS: Automated Evaluation of Retrieval Augmented Generation. In Proceedings of the Proceedings of the 18th Conference of the European Chapter of the Association for Computational Linguistics: System Demonstrations, 2024; pp. 150–158. [Google Scholar] [CrossRef]
  16. Saad-Falcon, J.; Khattab, O.; Potts, C.; Zaharia, M. ARES: An Automated Evaluation Framework for Retrieval-Augmented Generation Systems. In Proceedings of the Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2024; pp. 338–354. [Google Scholar] [CrossRef]
  17. TruEra, Snowflake. TruLens: Evaluation and Tracking for LLM Experiments. 2024. Available online: https://www.trulens.org/ (accessed on 2025-05-01).
  18. Langfuse. Langfuse: Open-Source LLM Engineering Platform—Tracing, Evaluation, and Observability. 2024. Available online: https://langfuse.com/ (accessed on 2025-05-01).
  19. Yu, W.; Zhang, H.; Pan, X.; Cao, P.; Ma, K.; Li, J.; Wang, H.; Yu, D. Chain-of-Note: Enhancing Robustness in Retrieval-Augmented Language Models. Proceedings of the Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing 2024, arXiv:cs, 14672–14685. [Google Scholar] [CrossRef]
  20. Huang, L.; Yu, W.; Ma, W.; Zhong, W.; Feng, Z.; Wang, H.; Chen, Q.; Peng, W.; Feng, X.; Qin, B.; et al. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions. ACM Trans. Inf. Syst. 2025, arXiv:cs43, 42:1–42:55. [Google Scholar] [CrossRef]
  21. Huang, Y.; Du, H.; Zhang, X.; Niyato, D.; Kang, J.; Xiong, Z.; Wang, S.; Huang, T. Large Language Models for Networking: Applications, Enabling Techniques, and Challenges. arXiv 2023, arXiv:2311.17474. [Google Scholar] [CrossRef]
  22. Maatouk, A.; Piovesan, N.; Ayed, F.; De Domenico, A.; Debbah, M. Large Language Models for Telecom: Forthcoming Impact on the Industry. IEEE Commun. Mag. 2025, arXiv:cs63, 62–68. [Google Scholar] [CrossRef]
  23. Wu, D.; Wang, X.; Qiao, Y.; Wang, Z.; Jiang, J.; Cui, S.; Wang, F. NetLLM: Adapting Large Language Models for Networking. In Proceedings of the Proceedings of the ACM SIGCOMM 2024 Conference, 2024. [Google Scholar] [CrossRef]
  24. Wang, J.; Zhang, L.; Yang, Y.; Zhuang, Z.; Qi, Q.; Sun, H.; Lu, L.; Feng, J.; Liao, J. Network Meets ChatGPT: Intent Autonomous Management, Control and Operation. J. Commun. Inf. Netw. 2023, 8, 239–255. [Google Scholar] [CrossRef]
  25. Bandlamudi, J.; Mukherjee, K.; Agarwal, P.; Dechu, S.; Huo, S.; Isahagian, V.; Muthusamy, V.; Purushothaman, N.; Sindhgatta, R. Towards Hybrid Automation by Bootstrapping Conversational Interfaces for IT Operation Tasks. Proc. Proc. AAAI Conf. Artif. Intell. 2023, Vol. 37, 15654–15660. [Google Scholar] [CrossRef]
  26. Wulf, J.; Meierhofer, J. Exploring the Potential of Large Language Models for Automation in Technical Customer Service. In Proceedings of the Digital Service Innovation: Redefining Provider-Customer Interactions—Proceedings of the Spring Servitization Conference, 2024; pp. 146–157. [Google Scholar] [CrossRef]
  27. Wang, Z.; Liu, Z.; Zhang, Y.; Zhong, A.; Wang, J.; Yin, F.; Fan, L.; Wu, L.; Wen, Q. RCAgent: Cloud Root Cause Analysis by Autonomous Agents with Tool-Augmented Large Language Models. In Proceedings of the Proceedings of the 33rd ACM International Conference on Information and Knowledge Management (CIKM), 2024; pp. 4966–4974. [Google Scholar] [CrossRef]
  28. Zhang, D.; Zhang, X.; Bansal, C.; Las-Casas, P.; Fonseca, R.; Rajmohan, S. PACE-LM: Prompting and Augmentation for Calibrated Confidence Estimation with GPT-4 in Cloud Incident Root Cause Analysis. arXiv 2023, arXiv:2309.05833. [Google Scholar] [CrossRef]
  29. An, K.; Yang, F.; Lu, J.; Li, L.; Ren, Z.; Huang, H.; Wang, L.; Zhao, P.; Kang, Y.; Ding, H.; et al. Nissist: An Incident Mitigation Copilot Based on Troubleshooting Guides. In Proceedings of the ECAI 2024: 27th European Conference on Artificial Intelligence, 2024. [Google Scholar] [CrossRef]
  30. Jiang, Y.; Zhang, C.; He, S.; Yang, Z.; Ma, M.; Qin, S.; Kang, Y.; Dang, Y.; Rajmohan, S.; Lin, Q.; et al. Xpert: Empowering Incident Management with Query Recommendations via Large Language Models. In Proceedings of the Proceedings of the 46th IEEE/ACM International Conference on Software Engineering (ICSE), 2024. [Google Scholar] [CrossRef]
  31. Shetty, M.; Bansal, C.; Upadhyayula, S.P.; Radhakrishna, A.; Gupta, A. AutoTSG: Learning and Synthesis for Incident Troubleshooting. In Proceedings of the Proceedings of the 30th ACM Joint European Software Engineering Conference and Symposium on the Foundations of Software Engineering (ESEC/FSE), 2022. [Google Scholar] [CrossRef]
  32. Roy, D.; Zhang, X.; Bhave, R.; Bansal, C.; Las-Casas, P.; Fonseca, R.; Rajmohan, S. Exploring LLM-Based Agents for Root Cause Analysis. In Proceedings of the Companion Proceedings of the 32nd ACM International Conference on the Foundations of Software Engineering (FSE Companion), 2024; pp. 208–219. [Google Scholar] [CrossRef]
  33. Ferrag, M.A.; Battah, A.; Tihanyi, N.; Jain, R.; Maimuţ, D.; Alwahedi, F.; Lestable, T.; Thandi, N.S.; Mechri, A.; Debbah, M.; et al. SecureFalcon: Are We There Yet in Automated Software Vulnerability Detection with LLMs? IEEE Trans. Softw. Eng. 2025, 51, 1248–1265. [Google Scholar] [CrossRef]
  34. Jiang, A.Q.; Sablayrolles, A.; Mensch, A.; Bamford, C.; Chaplot, D.S.; de las Casas, D.; Bressand, F.; Lengyel, G.; Lample, G.; Saulnier, L.; et al. Mistral 7B. arXiv 2023, arXiv:2310.06825. [Google Scholar] [CrossRef]
  35. Mistral, A.I. Mistral-Small-24B-Instruct-2501 Model Card. 2025. Available online: https://huggingface.co/mistralai/Mistral-Small-24B-Instruct-2501 (accessed on 2026-06-01).
  36. Touvron, H.; Martin, L.; Stone, K.; et al. Llama 2: Open Foundation and Fine-Tuned Chat Models. arXiv 2023, arXiv:2307.09288. [Google Scholar] [CrossRef]
  37. Amazon Web Services. Amazon Bedrock — Anthropic Claude models. 2026. Available online: https://docs.aws.amazon.com/bedrock/latest/userguide/model-cards-anthropic.html (accessed on 2026-06-01).
  38. NetBox Labs. NetBox: The Premier Network Source of Truth. 2024. Available online: https://netboxlabs.com/docs/netbox/ (accessed on 2026-06-01).
  39. Ciena. Ciena Route Explorer. 2026. Available online: https://www.ciena.com/insights/data-sheets/Route-Explorer.html (accessed on 2026-06-01).
  40. Selector, A.I. Selector AI Network Observability Platform. 2026. Available online: https://www.selector.ai/ (accessed on 2026-06-01).
  41. Case, J.; Fedor, M.; Schoffstall, M.; Davin, J. A Simple Network Management Protocol (SNMP); IETF, 1990. [Google Scholar]
  42. Scudder, J.; Fernando, R.; Stuart, S. BGP Monitoring Protocol (BMP); IETF, 2016. [Google Scholar]
  43. Formal, T.; Piwowarski, B.; Clinchant, S. SPLADE: Sparse Lexical and Expansion Model for First Stage Ranking. In Proceedings of the Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2021; pp. 2288–2292. [Google Scholar] [CrossRef]
  44. Cormack, G.V.; Clarke, C.L.A.; Büttcher, S. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. In Proceedings of the Proceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR), 2009; pp. 758–759. [Google Scholar] [CrossRef]
  45. Amazon Web Services. Amazon Bedrock — Cohere Embed Multilingual. 2026. Available online: https://docs.aws.amazon.com/bedrock/latest/userguide/model-card-cohere-embed-multilingual.html (accessed on 2026-06-01).
  46. Amazon Web Services. Amazon Bedrock — Cohere Embed v3 Model Parameters. 2026. Available online: https://docs.aws.amazon.com/bedrock/latest/userguide/model-parameters-embed-v3.html (accessed on 2026-06-01).
  47. Amazon Web Services. Amazon Bedrock — Supported Regions and Models for Reranking. 2026. Available online: https://docs.aws.amazon.com/bedrock/latest/userguide/rerank-supported.html (accessed on 2026-06-01).
Figure 1. Reference architecture of the agentic ChatOps system. A ChatOps front end (Slack/Teams) connects over a WebSocket to the Bot—a supervisor-orchestrated multi-agent system—whose intelligent query router selects a locally hosted open-weight model or a managed cloud model by data-sensitivity class (Section 3.4). A vector database backs retrieval, secrets are held in a dedicated vault, and the specialized agents reach the heterogeneous operational tool estate through a uniform (MCP/REST) interface.
Figure 1. Reference architecture of the agentic ChatOps system. A ChatOps front end (Slack/Teams) connects over a WebSocket to the Bot—a supervisor-orchestrated multi-agent system—whose intelligent query router selects a locally hosted open-weight model or a managed cloud model by data-sensitivity class (Section 3.4). A vector database backs retrieval, secrets are held in a dedicated vault, and the specialized agents reach the heterogeneous operational tool estate through a uniform (MCP/REST) interface.
Preprints 218007 g001
Figure 2. Supervisor-orchestrated multi-agent architecture (current deployment). After a context-summarization step, an LLM supervisor routes each intent—with a routing rationale, a confidence score, sticky routing, and fallback chains—to one or more of fourteen specialized domain workers (dispatched in parallel for multi-domain queries; each runs the control loop of Figure 3 with its own tool set), and a synthesis step merges their results into a response. Dedicated agents complement the routed workers: pattern-matched expert agents (change validation, configuration audit) and autonomous fast-path agents (self-healing, network troubleshooting) that act and terminate directly, with an incident-management policy gate. The full capability set is listed in Table 3.
Figure 2. Supervisor-orchestrated multi-agent architecture (current deployment). After a context-summarization step, an LLM supervisor routes each intent—with a routing rationale, a confidence score, sticky routing, and fallback chains—to one or more of fourteen specialized domain workers (dispatched in parallel for multi-domain queries; each runs the control loop of Figure 3 with its own tool set), and a synthesis step merges their results into a response. Dedicated agents complement the routed workers: pattern-matched expert agents (change validation, configuration audit) and autonomous fast-path agents (self-healing, network troubleshooting) that act and terminate directly, with an incident-management policy gate. The full capability set is listed in Table 3.
Preprints 218007 g002
Figure 3. Control loop executed within each specialized worker. Intent is decomposed into a plan; tools are invoked through function calling; observations are reflected upon and, if needed, trigger replanning; guardrails gate state-mutating actions and log every step. Memory and retrieval supply short- and long-term context.
Figure 3. Control loop executed within each specialized worker. Intent is decomposed into a plan; tools are invoked through function calling; observations are reflected upon and, if needed, trigger replanning; guardrails gate state-mutating actions and log every step. Memory and retrieval supply short- and long-term context.
Preprints 218007 g003
Figure 4. Current call flow for the four information-retrieval use cases in Section 3.7. Each request passes through context summarization and supervisor routing, is handled by the appropriate domain worker, invokes typed MCP/REST tools, assembles tool observations with provenance into an evidence package, and is synthesized into a grounded response.
Figure 4. Current call flow for the four information-retrieval use cases in Section 3.7. Each request passes through context summarization and supervisor routing, is handled by the appropriate domain worker, invokes typed MCP/REST tools, assembles tool observations with provenance into an evidence package, and is synthesized into a grounded response.
Preprints 218007 g004
Figure 5. Node-isolation detection workflow during maintenance. The current fast-path agent consumes AIOps and route-explorer signals, calls LSDB and service-impact tools through the MCP/REST layer, correlates any isolation against the planned maintenance context, and invokes the incident-management gate before paging and notifying service engineers.
Figure 5. Node-isolation detection workflow during maintenance. The current fast-path agent consumes AIOps and route-explorer signals, calls LSDB and service-impact tools through the MCP/REST layer, correlates any isolation against the planned maintenance context, and invokes the incident-management gate before paging and notifying service engineers.
Preprints 218007 g005
Figure 6. Vendor knowledge-base search workflow. The supervisor routes vendor-related incidents to a vendor worker, which collects logs and metadata, identifies the vendor/platform, retrieves relevant knowledge-base and internal case evidence, optionally updates support/RMA workflows, and returns synthesized findings and next actions.
Figure 6. Vendor knowledge-base search workflow. The supervisor routes vendor-related incidents to a vendor worker, which collects logs and metadata, identifies the vendor/platform, retrieves relevant knowledge-base and internal case evidence, optionally updates support/RMA workflows, and returns synthesized findings and next actions.
Preprints 218007 g006
Figure 7. Closed-loop backbone link-flapping remediation. Data-plane (SNMP) and control-plane (BMP/BGP+IGP) events are fused into a unified LinkEntity; on a threshold breach the self-healing autonomous agent confirms context, validates a candidate metric change in a digital twin, passes the incident-management policy gate, and only then invokes the configuration-management orchestrator to commit a Git-versioned change and notify the operations channel.
Figure 7. Closed-loop backbone link-flapping remediation. Data-plane (SNMP) and control-plane (BMP/BGP+IGP) events are fused into a unified LinkEntity; on a threshold breach the self-healing autonomous agent confirms context, validates a candidate metric change in a digital twin, passes the incident-management policy gate, and only then invokes the configuration-management orchestrator to commit a Git-versioned change and notify the operations channel.
Preprints 218007 g007
Figure 8. ChatOps multi-tool aggregator. On observing a UUID-tagged escalation thread, the supervisor dispatches a customer-scoped parallel query across service, incident, network-infrastructure, and knowledge-base workers, merges the returned evidence with provenance, synthesizes the result with the LLM, posts the consolidated picture to the thread, and logs the trace.
Figure 8. ChatOps multi-tool aggregator. On observing a UUID-tagged escalation thread, the supervisor dispatches a customer-scoped parallel query across service, incident, network-infrastructure, and knowledge-base workers, merges the returned evidence with provenance, synthesizes the result with the LLM, posts the consolidated picture to the thread, and logs the trace.
Preprints 218007 g008
Figure 9. Per-event handling time, manual versus agentic, across the six recurring use cases (logarithmic scale). Bold annotations give the reduction factor. The agentic approach reduces per-task time by one to three orders of magnitude.
Figure 9. Per-event handling time, manual versus agentic, across the six recurring use cases (logarithmic scale). Bold annotations give the reduction factor. The agentic approach reduces per-task time by one to three orders of magnitude.
Preprints 218007 g009
Figure 10. Representative London–Singapore layer-2 service. Redundant spine pairs (parallel) connect each metro’s service edge to its backbone; the backbones are joined in series via a third-party carrier.
Figure 10. Representative London–Singapore layer-2 service. Redundant spine pairs (parallel) connect each metro’s service edge to its backbone; the backbones are joined in series via a third-party carrier.
Preprints 218007 g010
Table 1. Positioning of this work relative to representative LLM-for-operations systems. ✓: present/supported; ∼: partial/prototype; ✗: not addressed.
Table 1. Positioning of this work relative to representative LLM-for-operations systems. ✓: present/supported; ∼: partial/prototype; ✗: not addressed.
System Primary domain Multi-tool LocalLLM RAG Prod.eval. Closed-loop
RCAgent [27] Cloud RCA
Nissist [29] Incident mitigation
Xpert [30] Incident query rec.
NetLLM [23] Networking (multi-task)
Network-ChatGPT [24] Intent mgmt.
AutoTSG [31] Troubleshooting synth.
This work Carrier network ops a b
Table 2. Evolution of the LLM deployment, from the initial (conference) configuration to the current production configuration.
Table 2. Evolution of the LLM deployment, from the initial (conference) configuration to the current production configuration.
Deployment Models
Initial (conference [1]) Open-weight instruction-tuned model served locally for confidential requests [34,35], with a cloud model used only for non-sensitive inference
Current (managed cloud) Managed Anthropic model tiers served through Amazon Bedrock: a lower-latency tier for high-volume interactions and a higher-capability tier for complex reasoning, within the operator’s cloud tenancy
Table 3. Deployed operational capability set. A supervisor routes to fourteen specialized domain workers, complemented by dedicated expert and autonomous agents (generic functional descriptions; internal system names omitted). The seven use cases detailed in Section 3.7 are instances within this set.
Table 3. Deployed operational capability set. A supervisor routes to fourteen specialized domain workers, complemented by dedicated expert and autonomous agents (generic functional descriptions; internal system names omitted). The seven use cases detailed in Section 3.7 are instances within this set.
Worker / agent Operational function
Supervisor-routed domain workers (14)
General troubleshooting assistant Broad Q&A and diagnostics with access to the full tool set
Inventory & infrastructure Device, circuit, and spare-location lookup over the source of truth
Service & connection lookup Customer-facing interconnection services and connection state
Network infrastructure Control-plane topology, path tracing, and traffic-engineering metrics
IP / DNS IP address management and DNS operations
Change requests Change-request creation, scheduling, and metadata retrieval
Service & incident management Incident/ticket creation, ITSM and ticketing workflow, case lookups
Vendor support Vendor technical-support (TAC) case management
Vendor insights Vendor product and knowledge insights
RMA tracking Return-merchandise-authorization case operations and status
Shipment / logistics Hardware shipment and logistics tracking
Device-access management Out-of-band/console device-access information
Service-inventory posting Service-inventory record creation and updates
Utilities & knowledge base Knowledge-base retrieval, operational CLIs, and notifications
Expert agents (pattern-matched)
Change validation Pre-/post-change validation against a baseline; validation reports
Configuration audit Configuration and compliance auditing
Autonomous fast-path agents
Self-healing Detection-triggered closed-loop remediation (e.g., link-flap cost-out)
Network troubleshooting / RCA Root-cause analysis, device-stability/flap detection, telemetry correlation
Incident-management gate Cross-agent policy gate invoked by the autonomous agents
Table 4. Per-event operational efficiency by use case. Manual and agentic figures are representative per-event operational timings; the reduction factor is their ratio ( τ manual / τ auto ). Production task frequencies and volumes, measured from Langfuse traces, are reported separately in Section 4.2.
Table 4. Per-event operational efficiency by use case. Manual and agentic figures are representative per-event operational timings; the reduction factor is their ratio ( τ manual / τ auto ). Production task frequencies and volumes, measured from Langfuse traces, are reported separately in Section 4.2.
Task Manual Agentic Reduction
(s/event) (s/event) factor
Spare locator 900 10 90×
Console information 30 3 10×
Packet loss / latency 3 600 15 240×
Path retrieval 3 600 5 720×
Node isolation 7 200 3 2 400×
Vendor search 10 800 30 360×
Table 5. Production-deployment measurements over a representative 90-day window, traced with Langfuse. User feedback is a voluntary thumbs-up/down rating recorded via the feedback API (168 of ∼ 7 , 400 interactions rated, a ∼ 2.3 % response rate, and a self-selected subset); model-token spend is measured as prompt and completion tokens per trace. These are observed production signals, not validated against ground truth.
Table 5. Production-deployment measurements over a representative 90-day window, traced with Langfuse. User feedback is a voluntary thumbs-up/down rating recorded via the feedback API (168 of ∼ 7 , 400 interactions rated, a ∼ 2.3 % response rate, and a self-selected subset); model-token spend is measured as prompt and completion tokens per trace. These are observed production signals, not validated against ground truth.
Measurement Value
Interactions traced (90 days) ∼7,400
Interaction channel
   via ChatOps messaging channel ∼91%
   other (automated tool invocations) ∼9%
Task class (partitions the ∼7,400 traces)
   automated self-healing/monitoring ∼58% (∼4,300)
   engineer-initiated queries ∼42% (∼3,100)
Rated responses (user feedback) 168
   positive / negative 104 / 64
Positive share of rated responses 61.9% (104/168)
   95% CI (normal approx.) [54.5%, 69.2%]
   response rate ∼2.3% (168/∼7,400)
Model-token spend (Bedrock) Prompt and completion tokens recorded per trace
Primary / reasoning serving tiers Managed model tiers on Amazon Bedrock
Table 6. Representative incident-workload stage counts from the upstream AIOps event-correlation platform [40], measured over a representative 30-day window. Each value is an independently measured platform count rather than a stage of a strict nested funnel: the distilled categories (high-impact alerts, actionable events, incidents created) are distinct measurements, so actionable events is not a subset of incidents created and the stages need not be monotonically ordered. The combined raw signal volume exceeds incidents created by roughly 600 : 1 . These auto-generated incidents are the dominant workload triaged by the agentic ChatOps system.
Table 6. Representative incident-workload stage counts from the upstream AIOps event-correlation platform [40], measured over a representative 30-day window. Each value is an independently measured platform count rather than a stage of a strict nested funnel: the distilled categories (high-impact alerts, actionable events, incidents created) are distinct measurements, so actionable events is not a subset of incidents created and the stages need not be monotonically ordered. The combined raw signal volume exceeds incidents created by roughly 600 : 1 . These auto-generated incidents are the dominant workload triaged by the agentic ChatOps system.
Stage (representative 30-day counts) Count
Raw signals (each stream measured independently)
   anomalies ∼1,919,632
   monitoring-system alerts ∼715,848
   combined raw signals ∼2.64 M
Distilled categories (each measured independently)
   high-impact alerts 7,885
   actionable events 645
   incidents created 4,382 (∼146/day)
Table 7. Offline RAG-quality benchmark over a fixed set of N = 200 operationally realistic questions, answered by the production primary managed-cloud model tier over the live retrieval corpus and scored by Langfuse LLM-as-a-judge evaluators. Each score is in [ 0 , 1 ] ; for the groundedness / no-hallucination score, a higher value denotes less hallucination. Means are reported with the sample standard deviation (SD) and a normal-approximation 95 % confidence interval over the 200 questions.
Table 7. Offline RAG-quality benchmark over a fixed set of N = 200 operationally realistic questions, answered by the production primary managed-cloud model tier over the live retrieval corpus and scored by Langfuse LLM-as-a-judge evaluators. Each score is in [ 0 , 1 ] ; for the groundedness / no-hallucination score, a higher value denotes less hallucination. Means are reported with the sample standard deviation (SD) and a normal-approximation 95 % confidence interval over the 200 questions.
Judge dimension Mean SD 95% CI
Relevance 0.847 0.157 [0.825, 0.869]
Context relevance 0.788 0.228 [0.756, 0.820]
Correctness 0.739 0.236 [0.707, 0.772]
Groundedness / no-hallucination score 0.707 0.302 [0.665, 0.749]
Helpfulness 0.653 0.198 [0.625, 0.680]
Conciseness 0.332 0.071 [0.322, 0.342]
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2026 MDPI (Basel, Switzerland) unless otherwise stated