Preprint
Article

This version is not peer-reviewed.

FedMARL-LTI: Federated Multi-Agent Reinforcement Learning with LLM-Driven Threat Intelligence for Cooperative Cyber Defense

Submitted:

02 July 2026

Posted:

03 July 2026

You are already at the latest version

Abstract
Cross-organization cyber defense must reconcile collaborative learning with privacy and adversarial robustness — yet standard federated learning ships full gradient tensors, leaking sensitive posture and inviting Byzantine manipulation. We present FedMARL-LTI, a federated multi-agent reinforcement learning framework whose architecture answers both pressures with a single decision: organizations share neither raw data nor model weights, only differentially-private 768-dimensional semantic threat embeddings. The contribution is fourfold. (1) Semantic Abstraction (SA) channel: per organization, each round, the local gradient is summarized by an LLM, projected to a 768-dim embedding, L2-clipped, and Gaussian-noised before any numeric quantity leaves the host. The bottleneck reduces the per-element noise scale from O(√(d_model )) to O(√m) with m=768≪dmodel≈3×105. (2) Formal privacy analysis: the SA+DP cascade satisfies (ε,δ)-DP and bounds per-round mutual-information leakage by min{Ttoklog2V, m⁄2 log2 (1+C2/(mσ2))} (Theorem 1), with Rényi composition over T federation rounds (Theorem 2). (3) Byzantine-resilient ClippedClustering aggregator combining L2 clipping with cosine-similarity clustering. (4) Hierarchical MARL policy with threat-profile-aware LLM-IRR reward shaping, wired end-to-end and disclosed honestly (the LLM call is currently stubbed with a deterministic projection for reproducibility). We evaluate on CybORG CAGE-4 with n=5 organizations, 30 federation rounds × 5 episodes × 100 steps per round. The SA channel adds statistical-zero utility cost vs. no-privacy baseline: SA-only Δreward = -0.66 (t=+0.31, NS), dual SA + Weight-DP Δreward = +1.90 (t=-0.71, NS), all N=5 seeds, all |t|< 1.3. A controlled signal/noise probe confirms a 19.58× improvement of SA over Weight-DP at fixed DP budget — matching the predicted √(d/m)≈19.8. Under Byzantine sign_flip at 30% (N=15), ClippedClustering is directionally strongest (F1=0.025 vs FedAvg 0.020, Krum 0.016) but the edge is not statistically significant (CC vs Krum t=+1.59, p=0.15, d=+0.58; the earlier N=5 “3.4×” gap was small-sample optimism, §5.2); its decisive Byzantine win is the harsher random_noise attack, where FedAvg diverges to NaN and Krum collapses to 0.002 while ClippedClustering survives at 0.020 (§5.7, Cohen’s d=+3.77). The cooperative-PPO family (MAPPO, IPPO) outperforms value/actor-critic (QMIX, MADDPG) by ≈20 reward units, p< 0.001. All host-level F1 values stay below 0.05 at the 15K-step training horizon used here; the relative claims of the paper (privacy zero-cost, ClippedClustering’s decisive Byzantine win on the harshest attacks per §5.7, cooperative-PPO dominance) are unaffected by this scope. A 200K-step long-horizon replication (§6.3 L1) lifts F1 above the 15K plateau (to ≈0.044, N=5) — confirming that horizon, not the privacy/Byzantine machinery, gates absolute accuracy — but a finer 60-checkpoint run shows the climb is volatile and non-monotonic and does not reach deployment-grade, an honest stability-not-compute limitation. We release all 141 raw run JSON outputs (Phases 1–3, the L4 backend comparison, and the algorithm/aggregator baselines), the figures, and analysis scripts for replication.
Keywords: 
;  ;  ;  ;  ;  ;  ;  
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

© 2026 MDPI (Basel, Switzerland) unless otherwise stated

Accessibility

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings