Preprint
Article

This version is not peer-reviewed.

Time-Aware Security Intelligence for Federated Financial Systems: Deep Reinforcement Learning Against Temporal Poisoning Attacks

Submitted:

01 October 2025

Posted:

02 October 2025

You are already at the latest version

Abstract
Financial institutions operating distributed machine learning systems face an emerging class of stealth adversaries who exploit temporal patterns across training cycles to inject persistent backdoors that remain dormant for months before activation. Unlike conventional single-round attacks, these sophisticated temporal poisoning strategies leverage sequential dependencies to bypass existing detection mechanisms while gradually compromising model integrity. Current defense frameworks remain fundamentally inadequate against such multi-period adversarial choreography, particularly in high-stakes financial environments where minute perturbations can trigger systemic failures. Existing security frameworks largely focus on static threat models and fail to address sophisticated multi-period adversarial strategies that unfold over time in financial transaction streams. To address these challenges, we propose DEFEND, a comprehensive defense framework that integrates temporal behavior analysis, robust statistical aggregation, and multi-scale verification into a unified multi-layer architecture. Our framework formulates defense coordination as a Markov Decision Process and employs Proximal Policy Optimization for adaptive policy learning that dynamically balances security enforcement with model utility. We design three sophisticated temporal attack models to comprehensively evaluate our defense mechanism: fixed-period data poisoning, multi-period data poisoning, and model weight poisoning attacks. The multi-layer defense architecture combines geometric median-based robust aggregation with Dynamic Time Warping pattern matching and adaptive client participation control. Extensive experiments on CIFAR-10, FEMNIST, and MNIST datasets demonstrate that DEFEND achieves superior defense performance with success rates of 95.6% for ResNet-18 and 94.0% for MobileNet V2, while maintaining clean accuracy levels between 85-95% across various data heterogeneity levels and malicious client ratios. Our framework provides theoretical guarantees for Byzantine robustness and practical scalability for moderate-scale federated deployments, making it well-suited for real-world financial applications requiring both security and efficiency.
Keywords: 
;  ;  ;  ;  ;  ;  ;  ;  ;  

1. Introduction

1.1. Background

The financial industry is undergoing rapid digital transformation, with institutions increasingly relying on distributed machine learning techniques to improve fraud detection, risk management, and regulatory compliance. Federated learning (FL) has emerged as a promising paradigm to enable collaborative intelligence across banks, payment platforms, and other financial entities while maintaining data privacy [1,2,3]. By allowing participants to train shared models without centralizing raw data, FL provides a natural fit for sensitive financial applications where strict privacy regulations and competitive concerns prevent data sharing.
However, the adoption of FL in financial systems introduces significant security challenges. Recent studies highlight that FL is inherently vulnerable to poisoning and backdoor attacks, where malicious participants inject crafted updates to manipulate global models [4,5]. These attacks can be particularly damaging in financial environments, where small perturbations may lead to large-scale fraud or systemic risks. Beyond traditional single-round threats, researchers have uncovered persistent backdoor strategies that exploit temporal dependencies across multiple training rounds, allowing attackers to evade conventional defenses and achieve long-term stealth [6,7]. Such temporal vulnerabilities are especially critical in financial transaction streams, which naturally exhibit sequential correlations and evolving patterns.
To address these challenges, researchers have proposed robust aggregation methods and privacy-enhancing techniques to mitigate adversarial updates in FL. Approaches such as robust learning rate adjustment [8,9,10] and privacy-preserving backdoor defenses for heterogeneous data distributions [11] have demonstrated effectiveness under certain threat models. Nevertheless, most of these methods remain static and fail to capture multi-period adversarial strategies that unfold over time. Moreover, existing solutions often trade accuracy for security, limiting their practicality in high-stakes domains like finance where both precision and reliability are paramount.
Meanwhile, reinforcement learning (RL) has shown considerable potential for adaptive cybersecurity, offering dynamic decision-making in the face of evolving adversarial behaviors [12]. By framing defense coordination as a sequential decision problem, RL methods can enable financial FL systems to respond adaptively to temporal threats, balancing security enforcement with model utility. This motivates the development of integrated frameworks that combine multi-layered defenses with RL-based coordination to address sophisticated temporal backdoor attacks in financial federated learning environments.

1.2. Motivation and Contributions

Despite growing efforts to enhance the robustness of federated learning, several critical research gaps remain unaddressed in the context of financial systems:
1.
Existing security frameworks do not adequately account for temporal backdoor attacks that exploit sequential dependencies in financial transaction streams. Most defenses assume static or isolated threat models, overlooking coordinated multi-round adversarial strategies.
2.
Current defense mechanisms are largely single-layered and static, focusing either on aggregation robustness or anomaly detection, without integrating multiple complementary layers that can collectively enhance resilience against adaptive attackers.
3.
Few approaches incorporate dynamic coordination mechanisms to balance security and utility. Static thresholds or fixed strategies cannot adapt to changing adversarial intensity, heterogeneous client behaviors, and evolving network conditions.
To address these gaps, this work introduces DEFEND, a comprehensive defense framework for federated learning in financial systems. Our framework integrates temporal behavior analysis, robust statistical aggregation, and multi-scale verification into a multi-layered architecture, while employing a Markov Decision Process (MDP) formulation and reinforcement learning for adaptive coordination. The contributions of this paper are summarized as follows:
  • We formalize temporal backdoor threats in federated financial learning by characterizing attack strategies that exploit multi-period dependencies across training rounds.
  • We design a multi-layer defense architecture that combines temporal behavior profiling, robust aggregation, and multi-scale verification to jointly enhance detection accuracy and resilience.
  • We formulate defense coordination as an MDP problem and develop an RL-based policy using Proximal Policy Optimization (PPO) to dynamically manage defense actions, balancing robustness and model performance.
  • We validate DEFEND on multiple benchmark datasets (CIFAR-10, FEMNIST, MNIST) under varying degrees of heterogeneity and adversarial participation, demonstrating superior defense success rates and detection efficiency compared to state-of-the-art baselines.
The remainder of this paper is organized as follows. Section 2 reviews existing research on federated learning security, temporal attack detection, and multi-layer defense coordination. Section 3 describes the proposed DEFEND framework. Section 4 presents experimental evaluations. Section 6 concludes and outlines directions for future work.

2. Related Work

2.1. Federated Learning Security in Financial Systems

Federated learning security in financial systems has emerged as a critical research area addressing the inherent privacy and security challenges in distributed financial data processing environments [13,14,15,16,17,18,19,20,21]. Chen et al. [13] conducted an extensive survey identifying the intricate security challenges within federated learning frameworks, emphasizing vulnerabilities in communication links and potential cyber threats across decentralized networks. Their comprehensive analysis delves into various defensive strategies and explores applications across different sectors, contributing to the development of secure and efficient federated learning systems. To address specific financial fraud detection challenges, Aljunaid et al. [14] proposed an Explainable Federated Learning (XFL) model that integrates Shapley Additive Explanations (SHAP) and LIME techniques for enhanced interpretability while maintaining privacy compliance. Their approach achieved 99.95% accuracy with a miss rate of 0.05%, effectively eliminating false positives in financial fraud classification while preserving data privacy and regulatory compliance.
Building on decentralized security considerations, Hallaji et al. [15] performed a thorough security analysis of decentralized federated learning systems, studying possible variations of threats and adversaries while overviewing potential defense mechanisms. Their work addresses server-related threats elimination through blockchain technologies, though acknowledging new privacy challenges introduced by decentralized architectures. To enhance privacy preservation in financial technology applications, Xiong et al. [20] developed a Heterogeneous Privacy-Preserving Blockchain-Enabled Federated Learning (HPP-BEFL) system specifically designed for social fintech environments. Their novel PKI and identity-based heterogeneous authenticated asymmetric group key agreement (PKI-IB-HAAGKA) protocol effectively mitigates man-in-the-middle and inference attacks while addressing crypto system heterogeneity issues.
Recent advances have also explored credit risk assessment architectures [16], behavioral anomaly detection in dynamic transaction graphs [17], and blockchain-based knowledge enhancement mechanisms [22]. However, existing federated learning security frameworks lack comprehensive temporal attack detection mechanisms and fail to adequately address sophisticated backdoor attacks that exploit temporal patterns in financial data streams, which are essential for defending against coordinated multi-period adversarial strategies in distributed financial learning environments.
Table 1. Comparison of our work with related studies.
Table 1. Comparison of our work with related studies.
Ref [13] [23] [15] [20] [24] [25] [26] [27] [14] [28] [29] Proposed work
Feature
Financial application domain
Federated learning framework
Temporal attack detection
Multi-layer defense design
MDP/RL coordination

2.2. Temporal Attack Detection and Defense Mechanisms

Temporal attack detection and defense mechanisms have gained significant attention as adversaries increasingly exploit time-dependent vulnerabilities in distributed systems [24,25,30,31,32,33,34,35]. Zamanzadeh et al. [25] provided a comprehensive survey of deep learning approaches for time series anomaly detection, highlighting the importance of identifying anomalous patterns that indicate novel or unexpected events such as production faults and system defects. Their taxonomy encompasses anomaly detection strategies and deep learning models, highlighting the challenges presented by the large size and complexity of temporal patterns in time series data. Duan et al. [24] addressed practical cyber attack detection through Continuous Temporal Graph (CTG) neural networks in dynamic network systems, proposing an interaction-centered perspective that refines information interactions between network entities into CTG evolution processes. Their framework naturally incorporates new node access behaviors and presents a message aggregation scheme that fuses spatio-temporal neighborhoods with actual time distribution and historical states, demonstrating superior performance on ToN-IoT, UNSWNB15, CIC-Dark2020, and J.P. Morgan payment datasets.
To address zero-day attack challenges, Wu et al. [31] developed an active learning framework using Deep Q-Network (DQN) for intelligent sample selection with probability distribution analysis. Their approach integrates Bi-directional Long Short-Term Memory (BiLSTM) networks into the DQN model to analyze temporal correlations within static classification contexts, employing Euclidean distance functions for accurate sample labeling. Hammad et al. [33] explored deep reinforcement learning for adaptive cyber defense, implementing cutting-edge DRL techniques including Deep Q-Networks (DQN), Proximal Policy Optimization (PPO), and Twin Delayed Deep Deterministic Policy Gradient (TD3) for real-time threat discernment and neutralization across varied cyber threat scenarios ranging from malware invasions to phishing attacks and adversarial assaults.
Recent developments have also investigated spatio-temporal advanced persistent threat detection in cyber-physical power systems [30], advances in time-series anomaly detection algorithms and benchmarks [36], and security defense strategies for Internet of Things based on deep reinforcement learning [37]. However, existing temporal attack detection frameworks lack sophisticated multi-period pattern recognition capabilities and fail to address coordinated backdoor injection strategies that exploit temporal dependencies across distributed learning rounds, which are crucial for detecting and defending against sophisticated temporal poisoning attacks in federated financial learning environments.

2.3. Multi-Layer Defense and MDP-based Coordination

Multi-layer defense mechanisms and MDP-based coordination strategies have emerged as critical components for robust federated learning systems, addressing the need for comprehensive protection against sophisticated adversarial attacks [23,26,27,28,29,38,39,40,41,42]. Li et al. [28] proposed a Multi-layer Aggregation Backdoor Defense Framework (MABDF) that ensures secure model aggregation through adaptive similarity filtering, pruning mean aggregation, and subspace robust projection methods. Their three-layer architecture calculates pairwise cosine similarity between client updates with dynamic thresholds based on median and standard deviation, applies pruning mean aggregation to detect hidden gradient operations, and projects updates onto low-rank subspaces through singular value decomposition (SVD) to suppress backdoor neuron activation, achieving backdoor attack success rates below 3% while maintaining model accuracy decrease of no more than 1.5%. Uddin et al. [23] conducted a systematic literature review using the PRISMA framework, analyzing 244 studies across eight themes of robust federated learning including objective regularization, optimizer modification, differential privacy, client selection, and new aggregation algorithms, providing comprehensive insights into approaches for enhancing FL model robustness against adversarial attacks and noisy updates.
For dynamic policy coordination, Bello et al. [26] developed a security strategy incorporating zero-trust models with dynamic policy decisions through stochastic games and reinforcement learning techniques. Their approach employs Generalized Proximal Policy Optimization with sample reuse (GePPO) and its meta-learning variant GePPO-ML, along with Sample Dropout PPO with meta-learning (SDPPO-ML) for adaptive policy updates, demonstrating superior performance compared to baseline REINFORCE and PPO algorithms in next generation network security scenarios. Huang et al. [27] addressed Byzantine robustness in heterogeneous federated learning through Self-Driven Entropy Aggregation (SDEA), leveraging random public datasets to conduct robust aggregation by introducing learnable aggregation weights that minimize instance-prediction entropy while maximizing batch-prediction entropy to accommodate diverse client tendencies and detect Byzantine attackers effectively.
Recent advances have also explored multi-layered protection systems for cloud computing environments [39], safe reinforcement learning frameworks for risk-averse dispatch with frequency security constraints [40], and security metrics for assessing power grids against attacks from EV charging ecosystems using Markov decision processes [29,43]. However, existing multi-layer defense frameworks lack integrated temporal anomaly detection capabilities and fail to provide coordinated MDP-based defense mechanisms that can dynamically adapt to evolving temporal backdoor attack patterns while maintaining optimal resource allocation and system performance in distributed financial learning environments.

3. Method

This section presents our comprehensive framework for detecting and defending against temporal backdoor attacks in federated learning environments through sophisticated mathematical modeling approaches. Our methodology integrates three distinct temporal poisoning attack models, establishes a multi-objective optimization framework, employs advanced statistical analysis techniques, and formulates an MDP-based defense mechanism. The proposed system leverages deep mathematical foundations to achieve optimal detection and defense performance across diverse federated learning scenarios while maintaining computational efficiency and theoretical rigor. Method architecture is shown in the Figure 1.

3.1. Problem Formulation

Consider a federated learning system comprising N participating clients denoted as C = { c 1 , c 2 , , c N } , where each client c i maintains local dataset D i = { ( x i , j , y i , j ) } j = 1 n i over a finite communication horizon T. The temporal data sequence for client c i is represented as X i = { x i 1 , x i 2 , , x i T } , where x i t R m denotes the m-dimensional feature vector observed at communication round t. The corresponding ground truth labels are denoted as y i = { y i 1 , y i 2 , , y i T } , where y i t Y represents the true class label at round t from the label space Y .
The client operational states are characterized by a discrete state space S = { s n , s s , s p } , where s n represents the normal operational state, s s indicates a suspicious state requiring further monitoring, and s p denotes a confirmed poisoned state. During the poisoning process, a subset of clients C p C with cardinality | C p | κ N , where κ ( 0 , 1 ) represents the maximum poisoning ratio, becomes compromised during specific time intervals T p { 1 , 2 , , T } .
Each client c i is characterized by a resource profile R i = ( C i max , M i max , E i max , B i max ) , where C i max represents maximum computational capacity, M i max denotes memory capacity, E i max indicates energy budget, and B i max specifies communication bandwidth. The available resources at round t are denoted as R i t = ( C i t , M i t , E i t , B i t ) , where each component satisfies 0 R i t R i .
The fundamental objective of our detection framework is to minimize the overall detection error, formulated as a multi-objective optimization problem with temporal consistency constraints:
L τ = i = 1 N t = 1 T l d ( s ^ i t , s i t ) + λ 1 i C p l t ( T ^ p , i , T p , i ) + λ 2 R f , p + λ 3 0 T C ( u ) d u + λ 4 H ( s t ) ,
where l d : S × S R 0 represents the client state classification loss function, s ^ i t denotes the predicted client state, l t quantifies the temporal localization error between predicted intervals T ^ p , i and true poisoning intervals T p , i , R f , p denotes the false positive penalty term, C ( u ) is the continuous cost function modeling resource consumption, H ( s t ) represents the entropy regularization term for uncertainty quantification, and λ 1 , λ 2 , λ 3 , λ 4 R > 0 are regularization coefficients that balance different objectives.
The temporal localization error is computed using a weighted Hausdorff distance:
l t ( T ^ p , i , T p , i ) = 1 2 max t T ^ p , i min τ T p , i | t τ | + max τ T p , i min t T ^ p , i | t τ | ,
which ensures both precision and recall in temporal poisoning interval detection.

3.2. Temporal Backdoor Attack Models

To comprehensively evaluate our detection framework, we design three sophisticated temporal backdoor strategies that exploit different characteristics of federated learning protocols. Each attack model represents a distinct threat vector commonly encountered in real-world federated environments, incorporating realistic constraints on attack capabilities and detectability thresholds.

3.2.1. Fixed-Period Data Poisoning Attack

The fixed-period data poisoning attack systematically injects backdoor triggers into client datasets during predetermined temporal windows, eventually culminating in significant global model degradation at coordinated trigger points. This attack strategy maintains stealth by concentrating the poisoning effect within specific time intervals while preserving normal behavior patterns outside attack periods.
For malicious client c i C p at communication round t, the poisoned local dataset D ˜ i t is constructed according to:
D ˜ i t = D i P i t if t T ϕ , D i otherwise ,
where T ϕ = [ t s , t e ] { 1 , 2 , , T } denotes the fixed attack interval, and the poisoned sample set P i t follows a sophisticated generation process incorporating temporal dynamics and spatial correlations.
The poisoned sample construction employs multi-dimensional trigger injection with harmonic modulation:
P i t = ( x j + ϵ i t a i t + k = 1 K α k sin 2 π k ( t t s ) | T ϕ | + ϕ k w k , y τ ) : ( x j , y j ) S i t ,
where ⊙ denotes element-wise multiplication, a i t R m represents the primary attack direction vector, K is the number of harmonic components, α k , ϕ k control amplitude and phase parameters, w k R m are harmonic weight vectors, y τ denotes the target label, and S i t D i represents the selected subset for poisoning.
The time-dependent perturbation magnitude ϵ i t follows an accumulative schedule incorporating memory effects and stochastic variations:
ϵ i t = α i τ = t s t β t τ γ τ · I [ τ T ϕ ] · exp ( τ μ i ) 2 2 σ i 2 · 1 + δ i j N i ζ i , j τ ,
where α i > 0 represents the base accumulation rate, β ( 0 , 1 ) is the temporal decay factor, γ τ are stochastic scaling factors following a Markov chain with transition probabilities P k , l = exp ( θ k , l ) j exp ( θ k , j ) , the Gaussian envelope ensures smooth temporal transitions with mean μ i and variance σ i 2 , δ i controls neighborhood influence, N i represents neighboring clients, and ζ i , j τ captures inter-client correlation effects.

3.2.2. Multi-Period Data Poisoning Attack

The multi-period data poisoning attack exploits temporal dependencies by distributing backdoor injection across multiple non-consecutive communication rounds, creating sophisticated interference patterns that evade single-period detection mechanisms while maintaining cumulative backdoor effectiveness.
The poisoned dataset construction employs period-specific trigger variations across disjoint temporal intervals T μ = { T 1 , T 2 , , T K } where T j T k = for j k :
D ˜ i t = D i P i , j t if t T j , j { 1 , 2 , , K } , D i otherwise ,
where period-specific poisoned samples P i , j t incorporate adaptive trigger patterns and cross-period consistency constraints.
The multi-dimensional periodic signal injection follows:
P i , j t = ( x l + γ i S i , j t + δ i Q i , j t , y τ ) : ( x l , y l ) S i , j t ,
where γ i , δ i R > 0 control injection intensities for primary and secondary periodic signals S i , j t and Q i , j t respectively.
The primary multi-dimensional periodic signal S i , j t is defined as:
S i , j t = k = 1 K j A i , j , k sin 2 π t P i , j , k + ϕ i , j , k W i , j , k t + l = 1 L j B i , j , l cos 2 π t Q i , j , l + ψ i , j , l V i , j , l t ,
where K j and L j represent the numbers of sine and cosine components for period j, A i , j , k , B i , j , l R m denote amplitude vectors, P i , j , k , Q i , j , l are fundamental periods, ϕ i , j , k , ψ i , j , l are phase shifts, and W i , j , k t , V i , j , l t are time-varying weight vectors.

3.2.3. Model Weight Poisoning Attack

The model weight poisoning attack creates direct manipulation of local model parameters before transmission to the server, generating high-intensity anomalous parameter patterns that bypass data-level detection mechanisms while maintaining coordinated execution across multiple malicious clients.
The poisoned model update follows a multi-state regime-switching framework with sophisticated perturbation strategies:
θ ˜ i t = θ i t + δ i t n i t + η i t m i t if c i C p and M i t = 1 , θ i t + 1 2 δ i t n i t if c i C p and M i t = 2 , θ i t + ε i t χ i t if c i C p and M i t = 3 , θ i t otherwise ,
where M i t { 1 , 2 , 3 , 4 } is a Markov chain state indicator controlling attack intensity, δ i t controls primary poisoning magnitude, n i t is the primary directional perturbation vector, η i t controls secondary poisoning magnitude, m i t is the secondary perturbation vector, ε i t controls background noise magnitude, and χ i t represents background perturbations.
The poisoning intensity follows a multi-phase decay model coordinated across malicious clients:
δ i t = δ m , i f i ( t t , i ) · I [ c i C p ] ,
f i ( u ) = exp u τ d , 1 , i if 0 u T 1 , i , exp u T 1 , i τ d , 2 , i if T 1 , i < u T 2 , i , exp u T 2 , i τ d , 3 , i if u > T 2 , i ,
where δ m , i is the maximum intensity for client c i , t , i is the attack initiation time, τ d , j , i are phase-specific decay time constants, and T j , i are phase transition times.

3.3. Multi-Layer Defense Framework

Our defense strategy employs a sophisticated three-tier detection architecture that combines temporal behavioral analysis, robust statistical aggregation, and multi-scale validation protocols to counter the identified attack vectors while maintaining theoretical guarantees and computational efficiency.

3.3.1. Temporal Behavioral Analysis Layer

The temporal analysis layer monitors client behavioral patterns across communication rounds through comprehensive statistical profiling and anomaly detection mechanisms incorporating both individual client dynamics and cross-client correlation analysis.
For each client c i , we construct a multi-dimensional behavioral profile vector encompassing temporal features, historical patterns, connectivity measures, and uncertainty quantification:
b i t = [ F i t , H i t , C i t , U i t , R i t ] ,
where F i t represents temporal features, H i t encodes historical patterns, C i t models connectivity, U i t quantifies uncertainty, and R i t captures resource utilization patterns.
The temporal feature vector F i t R d f contains multi-scale statistical features extracted from recent communication windows:
F i t = [ μ i t , σ i t , ς i t , κ i t , F ω , i t , F ψ , i t , F ξ , i t ] ,
where the statistical moments are computed using robust estimators across multiple time scales:
μ i t = 1 w f τ = t w f + 1 t θ i τ F + 1 w s τ = t w s + 1 t w f θ i τ F ,
σ i t = 1 w f 1 τ = t w f + 1 t ( θ i τ F μ i t ) 2 + ϵ σ ,
ς i t = 1 w f τ = t w f + 1 t θ i τ F μ i t σ i t 3 + ρ ς ς i t 1 ,
κ i t = 1 w f τ = t w f + 1 t θ i τ F μ i t σ i t 4 3 + ρ κ κ i t 1 ,
where w f and w s are short and long window sizes, ϵ σ prevents numerical instability, ρ ς , ρ κ ( 0 , 1 ) provide temporal smoothing, F ω , i t R d ω contains dominant frequency components from spectral analysis, F ψ , i t R d ψ includes wavelet coefficients, and F ξ , i t R d ξ represents spectral entropy measures.
The anomaly detection mechanism employs multi-scale sliding window analysis with adaptive thresholding:
A i t = w W 1 w τ = t w + 1 t b i τ μ i t w Σ i t w > τ α w ,
where W = { w 1 , w 2 , , w L } contains multiple window sizes, μ i t w and Σ i t w represent historical mean and covariance computed through robust estimation, · Σ denotes the Mahalanobis distance, and τ α w are scale-specific detection thresholds.
For multi-period attack detection, we implement a sophisticated pattern matching algorithm based on dynamic time warping:
P i t = max P P Θ DTW ( { A i τ } τ = t T P + 1 t , P ) ,
where P Θ contains known attack pattern templates and DTW computes optimal alignment scores accounting for temporal distortions.
The DTW distance is computed through dynamic programming with warping constraints:
DTW ( X , Y ) = D [ n x , n y ] ,
D [ i , j ] = d ( x i , y j ) + min { D [ i 1 , j ] , D [ i , j 1 ] , D [ i 1 , j 1 ] } ,
s . t . | i j | W warp ,
where d ( x i , y j ) is the local distance function, n x , n y are sequence lengths, and W warp constrains warping flexibility.

3.3.2. Robust Statistical Aggregation Layer

The aggregation layer employs Byzantine-robust techniques enhanced with temporal consistency constraints and multi-dimensional filtering to identify and mitigate malicious model updates while preserving the convergence properties of the global optimization process. The robust aggregation mechanism operates through sequential filtering stages incorporating geometric median estimation, distance-based outlier detection, and weighted consensus formation. The robust central tendency is established through iterative geometric median estimation:
θ ^ ν t = arg min θ i = 1 N θ θ i t F ,
solved using the accelerated Weiszfeld algorithm with momentum-based convergence enhancement:
θ ( k + 1 ) = i = 1 N θ i t θ ( k ) θ i t F + ϵ i = 1 N 1 θ ( k ) θ i t F + ϵ ,
θ ( k + 1 ) θ ( k + 1 ) + α k ( θ ( k + 1 ) θ ( k ) ) ,
where ϵ prevents numerical instabilities and α k provides adaptive momentum acceleration. Suspicious clients are identified through sophisticated distance-based analysis incorporating multiple statistical measures:
S t = c i : θ i t θ ^ ν t F > Q 1 α ( { d j t } j = 1 N ) + β · IQR ( { d j t } j = 1 N ) A i t = True ,
where d j t = θ j t θ ^ ν t F , Q 1 α denotes the ( 1 α ) -quantile, IQR represents the interquartile range, β > 0 controls filtering sensitivity, and A i t incorporates temporal anomaly detection results. The final global model incorporates temporal consistency constraints and reputation-based weighting:
θ t = arg min θ i S t w i t θ θ i t F 2 + λ τ θ θ t 1 F 2 + λ ν θ θ ^ ν t F 2 ,
where the client weights incorporate multi-factor scoring:
w i t = exp ( γ · ( A i t + I [ c i S t ] + R i t ) ) j S t exp ( γ · ( A j t + I [ c j S t ] + R j t ) ) ,
and the reputation score R i t captures historical behavior patterns through multi-scale temporal weighting:
R i t = k = 1 K r ω k τ = max ( 1 , t w k ) t 1 exp t τ σ k I [ c i flagged at round τ ] ,
where K r is the number of temporal scales, ω k are scale-specific weights, w k are window sizes, and σ k control exponential decay rates.

3.3.3. Multi-Scale Validation Layer

The validation layer provides continuous monitoring of global model integrity through clean performance tracking, backdoor trigger detection, and coordinated response protocols incorporating sophisticated statistical testing and model rollback mechanisms. We maintain a held-out validation set D υ and track performance degradation through robust statistical testing:
L c t = 1 | D υ | ( x , y ) D υ l ( f θ t ( x ) , y ) ,
with alert generation through sequential hypothesis testing:
A c t = I [ L c t > L c t 1 + τ c ] I [ L c t > μ β + σ β · z α c ] ,
where μ β , σ β characterize historical performance distribution and z α c is the critical value for significance level α c . We maintain a trigger detection dataset D ξ = { ( x j + δ k , y j ) } covering potential trigger patterns and monitor:
L β t = max y τ 1 | D ξ | ( x ξ , y π ) D ξ I [ f θ t ( x ξ ) = y τ ] ,
with backdoor detection alert:
A β t = I [ L β t > τ β ] I [ L β t L c t > τ ρ ] ,
where τ β and τ ρ are detection thresholds for absolute and relative trigger activation rates. Upon alert generation, the system implements graduated response protocols through state transition mechanisms:
R t = M e if A c t ¬ A β t , I c if A β t P i t > τ π , R θ if A β t L β t > τ κ , N o otherwise ,
where M e denotes enhanced monitoring, I c represents client investigation, R θ implements model rollback, and N o continues normal operation. The model rollback mechanism employs temporal checkpointing with integrity verification:
θ t = θ t k if R t = R θ V ( θ t k ) > τ υ , arg min k K max V ( θ t k ) if R t = R θ V ( θ t k ) τ υ , θ t otherwise ,
where k is determined by the severity of the detected anomaly, V ( · ) quantifies model integrity through comprehensive validation metrics, and K max bounds the maximum rollback distance.

3.4. MDP Framework for Defense Coordination

We formulate the temporal backdoor defense problem as a Markov Decision Process (MDP) defined as M d = ( S d , A d , P d , R d , γ ) , where each component captures the sequential decision-making nature of coordinated defense mechanisms with comprehensive state representation and sophisticated action space design for optimal defense coordination.

3.4.1. Defense State Space Design

The defense environment state s t d S d at communication round t encompasses comprehensive information about the federated system security status through multiple information channels:
s t d = { B t , D t , A t , S t , U t , R t , G t } ,
where B t represents client behavioral profiles, D t encodes detection history, A t models anomaly indicators, S t maintains security metrics, U t quantifies uncertainty measures, R t captures resource utilization, and G t represents global system statistics.
The client behavioral profile matrix B t R N × d b contains comprehensive behavioral features for all clients:
B t [ i , : ] = [ b i t , Δ b i t , b i t b i t 1 2 , rank ( b i t ) , B ρ , i t ] ,
where b i t is the current behavioral profile, Δ b i t represents temporal changes, the norm captures profile stability, rank ( b i t ) indicates behavioral complexity, and B ρ , i t encodes cross-client correlations.
The detection history matrix D t R N × d d maintains weighted information about previous detection decisions with multi-scale temporal decay:
D t [ i , : ] = k = 1 K d ω d , k τ = max ( 1 , t w d , k ) t 1 ω d ( t τ ) / k · I [ s ^ i τ = s p ] , R i t , C i t , V i t ,
where K d is the number of temporal scales, ω d , k are scale-specific weights, w d , k are scale-specific window sizes, ω d ( 0 , 1 ) provides exponential temporal weighting, R i t represents reputation scores, C i t captures confidence levels, and V i t quantifies detection variance.
The anomaly indicator vector A t R N captures multiple types of anomalies through ensemble-based detection:
A t [ i ] = α a A τ , i t + β a A σ , i t + γ a A γ , i t + δ a A ς , i t ,
where α a , β a , γ a , δ a are weighting coefficients, A τ , i t represents temporal anomalies, A σ , i t captures statistical anomalies, A γ , i t indicates geometric median deviations, and A ς , i t quantifies spectral anomalies.
The security metrics vector S t R d s tracks system-wide security indicators:
S t = [ ξ t g , ξ t h , ξ t c , ξ t z , ξ t θ , ξ t r ] ,
where ξ t g measures global model integrity, ξ t h quantifies information entropy, ξ t c captures aggregation consensus, ξ t z indicates system stability, ξ t θ represents threat level assessment, and ξ t r measures defense resilience.

3.4.2. Defense Action Space Formulation

The defense action space A d encompasses coordinated defense decisions, resource allocation, and temporal control through a hierarchical structure:
A d = A δ × A μ × A α × A ζ ,
where A δ represents detection actions, A μ determines mitigation strategies, A α controls resource allocation, and A ζ manages temporal adaptations.
The detection action space includes comprehensive detection strategies:
A δ = { a m , a i , a v , a p , a q , a e } N ,
where a m denotes passive monitoring, a i represents detailed inspection, a v indicates verification protocols, a p triggers active probing, a q implements temporary isolation, and a e enforces permanent exclusion.
The mitigation action space operates on multiple defense strategies with coordination constraints:
A μ = { μ [ 0 , 1 ] N × M : j = 1 M μ i , j = 1 i , i = 1 N μ i , j C j j } ,
C ρ = max i , j , k , l μ i , j μ k , l ζ ρ ,
where μ i , j represents the intensity of mitigation strategy j applied to client i, M is the number of mitigation strategies including: enhanced monitoring ( j = 1 ), gradient clipping ( j = 2 ), noise injection ( j = 3 ), weight decay regularization ( j = 4 ), and adaptive learning rate scaling ( j = 5 ), C j bounds the total capacity for strategy j, C ρ ensures coordination constraints, and ζ ρ limits mitigation disparity.
The resource allocation space manages computational and communication resources across defense mechanisms:
A α = { ρ [ 0 , 1 ] D × R : r = 1 R ρ d , r = 1 d , ρ d , r ρ min d , r } ,
E η = d = 1 D r = 1 R ρ d , r F d , r ( S t ) E min ,
where ρ d , r represents the fraction of resource type r allocated to defense mechanism d, D is the number of defense mechanisms including: temporal analysis ( d = 1 ), statistical aggregation ( d = 2 ), and validation monitoring ( d = 3 ), R is the number of resource types including: CPU computation ( r = 1 ), memory storage ( r = 2 ), and communication bandwidth ( r = 3 ), ρ min ensures minimum allocation, F d , r quantifies efficiency functions, and E min guarantees minimum defense effectiveness.
The temporal adaptation action space A ζ controls dynamic client participation and weight acceptance decisions:
A ζ = { ζ { 0 , 1 } N : i = 1 N ζ i N min } ,
where ζ i { 0 , 1 } indicates whether to accept model weights from client c i at the current round (1 for accept, 0 for reject), and N min ensures minimum participation for convergence. The temporal adaptation decision for each client is governed by:
ζ i t = 0 if A i t = True P i t > τ ζ , 0 if c i S t R i t > τ ρ , Bernoulli ( p i t ) if A i t = True P i t τ ζ , 1 otherwise ,
where τ ζ and τ ρ are rejection thresholds for anomaly scores and reputation scores respectively, and the stochastic acceptance probability is:
p i t = σ ( α ζ β ζ A i t γ ζ R i t δ ζ θ i t θ ^ ν t F ) ,
where σ ( · ) is the sigmoid function, and α ζ , β ζ , γ ζ , δ ζ are learned parameters that balance acceptance probability based on anomaly levels, reputation, and parameter deviation.

3.4.3. Defense Transition Dynamics

The state transition probabilities P d : S d × A d × S d [ 0 , 1 ] capture the stochastic evolution of the defense environment under coordinated security actions:
P d ( s t + 1 d | s t d , a t d ) = j = 1 | s | P j ( s t + 1 , j d | s t d , a t d ) ,
where the factorization assumes conditional independence across state components given the current state and action.
The behavioral profile transition dynamics incorporate temporal evolution and defense interventions:
P B ( B t + 1 | s t d , a t d ) = i = 1 N N ( b i , t + 1 | μ b , i t + W b a δ , i t , Σ b , i t ) ,
μ b , i t = A b b i t + B b h i t + c b ,
h i t = tanh ( U h b i t 1 + V h a δ , i t 1 + b h ) ,
where A b , B b , W b are learned transition matrices, h i t represents latent behavioral states, U h , V h control temporal dependencies, and Σ b , i t captures uncertainty in behavioral evolution.
The detection history transitions incorporate memory decay and decision outcomes:
P D ( D t + 1 | s t d , a t d ) = i = 1 N δ ( d i , t + 1 T d ( d i , t , a δ , i t , ω t ) ) ,
where δ ( · ) is the Dirac delta function, T d represents the deterministic update function, and ω t captures environmental stochasticity.

3.4.4. Defense Reward Function Design

The defense reward function R d : S d × A d × S d R incorporates multiple defense objectives through a sophisticated multi-criteria framework:
R d ( s t d , a t d , s t + 1 d ) = λ 1 i = 1 N R δ , i ( s t d , a t d , s t + 1 d ) + λ 2 i = 1 N R μ , i ( s t d , a t d ) + λ 3 R σ ( s t d , a t d , s t + 1 d ) + λ 4 R η ( a t d ) λ 5 R κ ( a t d ) λ 6 R ϕ ( s t d , a t d , s t + 1 d ) ,
where the reward components capture detection accuracy, mitigation effectiveness, security improvement, operational efficiency, resource costs, and system disruption.
The detection reward incorporates accuracy, timeliness, and confidence weighting:
R δ , i ( s t d , a t d , s t + 1 d ) = I [ s i * = s p ] · I [ a δ , i { a i , a v , a p } ] · η δ · e λ τ ( t t π , i ) · ( 1 U t [ i ] ) · W ν i ,
where s i * represents the true client state, η δ is the base detection reward, λ τ controls temporal decay, t π , i is the actual attack start time, U t [ i ] quantifies detection uncertainty, and W ν i captures network effect weighting.
The mitigation reward measures the effectiveness of applied defense strategies:
R μ , i ( s t d , a t d ) = j = 1 M μ i , j · E j ( S t , A t [ i ] ) · I [ a δ , i a m ] · η μ ,
where E j quantifies the effectiveness of mitigation strategy j given system state and anomaly levels, and η μ is the base mitigation reward.
The security reward tracks overall system security improvement:
R σ ( s t d , a t d , s t + 1 d ) = k = 1 d s ω k σ max ( 0 , S t + 1 [ k ] S t [ k ] ) + η σ I θ t ,
where ω k σ are security metric weights, S t [ k ] represents individual security metrics, η σ is the threat mitigation reward, and I θ t indicates successful threat neutralization.

3.4.5. Defense Policy Optimization

The optimal defense policy π * : S d A d is learned through advanced reinforcement learning techniques incorporating temporal credit assignment and multi-objective optimization:
π * = arg max π E τ π t = 0 T 1 γ t R d ( s t d , a t d , s t + 1 d ) ,
where τ represents a trajectory, γ ( 0 , 1 ) is the discount factor, and the expectation is taken over the policy-induced distribution.
The policy optimization employs actor-critic architecture with attention mechanisms:
π θ ( a t d | s t d ) = softmax ( W π h π t + b π ) ,
h π t = Attention ( Q π , K π , V π ) + f π ( s t d ) ,
V ϕ ( s t d ) = W V h V t + b V ,
where h π t , h V t are attention-enhanced hidden representations, f π encodes state features, and θ , ϕ are learnable parameters.
Algorithm 1 DEFEND: DEep Federated Ensemble Network Defense
  1:
Input: Client updates { θ i t } i = 1 N , historical profiles { b i t 1 } i = 1 N , defense state s t 1 d , hyperparameters λ τ , λ ν , γ , window sizes W , detection thresholds { τ α w } ;
  2:
Output: Aggregated model θ t , updated profiles { b i t } i = 1 N , defense action a t d ;
  3:
for each client c i  do
  4:
    Compute temporal profile b i t
▹ Equation (12)
  5:
    Calculate statistical moments { μ i t , σ i t , ς i t , κ i t }
▹ Equation (14)-(17)
  6:
    Detect temporal anomalies A i t
▹ Equation (18)
  7:
    Check multi-period patterns P i t
▹ Equation (19)
  8:
end for
  9:
Construct defense state s t d
▹ Equation (36)
10:
Select defense action a t d π θ ( a t d | s t d )
▹ Equation (60)
11:
Determine client participation { ζ i t } i = 1 N
▹ Equation (48)
12:
Compute geometric median θ ^ ν t for active clients
▹ Equation (23)
13:
Apply Weiszfeld algorithm with updates
▹ Equation (24)-(25)
14:
Perform outlier detection to identify S t
▹ Equation (26)
15:
Calculate client weights { w i t } for participating clients
▹ Equation (28)
16:
Execute weighted aggregation θ t
▹ Equation (27)
17:
Monitor clean performance L c t
▹ Equation (30)
18:
Check trigger responses L β t
▹ Equation (32)
19:
Generate alerts { A c t , A β t }
▹ Equation (31)-(33)
20:
if  A c t A β t then
21:
    Execute graduated response R t
▹ Equation (34)
22:
    Update reputation scores
▹ Equation (29)
23:
    if  R t = R θ  then
24:
        Perform model rollback
▹ Equation (35)
25:
    end if
26:
end if
27:
Compute reward r t d = R d ( s t d , a t d , s t + 1 d )
▹ Equation (55)
28:
Update policy parameters θ and value function ϕ using PPO
29:
Store experience tuple ( s t d , a t d , r t d , s t + 1 d ) in replay buffer
30:
return  θ t , { b i t } i = 1 N , a t d ;
The complete defense framework integrates all components into a cohesive algorithm that executes at each communication round through coordinated multi-layer processing and decision making:
We analyze the computational complexity of Algorithm 1 by examining each component separately and providing theoretical bounds for the overall framework execution. The computation of behavioral profiles for all clients in lines 3-6 requires O ( N · d f · w max ) operations, where d f is the feature dimension and w max = max ( w f , w s ) is the maximum window size. The anomaly detection using Equation (18) across multiple window sizes has complexity O ( N · | W | · d b 2 ) due to Mahalanobis distance computations, where d b is the behavioral profile dimension. The DTW-based pattern matching in Equation (19) requires O ( N · | P Θ | · T P 2 ) operations for N clients and | P Θ | pattern templates. State construction according to Equation (36) has complexity O ( N · d s ) where d s is the total state dimension. Policy evaluation using the attention mechanism from Equation (60)-(61) requires O ( d s 2 + d a · d s ) operations, where d a is the action space dimension. The geometric median computation via the Weiszfeld algorithm in lines 10-11 has complexity O ( K iter · N active · d ) where K iter is the number of iterations, N active = i = 1 N ζ i t is the number of participating clients, and d is the model parameter dimension. Outlier detection using Equation (26) requires O ( N active 2 · d ) for pairwise distance computations. The weighted aggregation from Equation (27) has complexity O ( N active · d ) . Clean performance monitoring using Equation (30) requires O ( | D υ | · d ) operations for model evaluation. Trigger detection from Equation (32) has complexity O ( | D ξ | · | Y | · d ) where | Y | is the number of possible target labels. Model rollback using Equation (35) when triggered requires O ( K max · d ) operations. Reward computation using Equation (55) has complexity O ( N · M ) where M is the number of mitigation strategies. Policy and value function updates require O ( d θ + d ϕ ) operations where d θ , d ϕ are the parameter dimensions. The total computational complexity per communication round is:
O ( N · max ( N · d , | W | · d b 2 , | P Θ | · T P 2 ) + K iter · N active · d + | D υ | · d + | D ξ | · | Y | · d ) ,
which scales quadratically with the number of clients in the worst case due to pairwise distance computations, but remains practical for moderate-scale federated deployments with N 100 clients and efficient implementation of geometric median algorithms. The temporal adaptation mechanism in line 9 reduces the effective computational load by filtering out suspicious clients early, leading to N active < N in most scenarios, thereby improving overall efficiency.

4. Experiment

This section presents comprehensive experimental evaluations to validate the effectiveness of our proposed DEFEND framework against temporal backdoor attacks in federated learning environments. We conduct extensive experiments across multiple datasets, network architectures, and attack scenarios to demonstrate the robustness and practical applicability of our multi-layer defense mechanism.

4.1. Experimental Setup

We evaluate our framework on three widely-used federated learning benchmarks: CIFAR-10, FEMNIST, and MNIST. These datasets represent diverse application domains with varying data characteristics and complexity levels. To simulate realistic federated environments, we consider both Independent and Identically Distributed (IID) and Non-IID data distributions across clients. For Non-IID scenarios, we employ Dirichlet distribution with concentration parameters α { 0.1 , 0.5 , 1.0 } , where smaller values indicate higher data heterogeneity. The client population varies from 10 to 50 participants to assess scalability across different federation sizes. We evaluate defense performance under various threat models by varying the malicious client ratio κ { 0.1 , 0.2 , 0.3 , 0.4 } , representing scenarios where 10% to 40% of participants are compromised.
We employ two representative deep learning architectures: MobileNet V2 for lightweight mobile applications and ResNet-18 for standard computer vision tasks. The detailed experimental parameters and hyperparameter configurations are summarized in Table 2.
We implement the three temporal backdoor attack strategies described in Section 3.2: fixed-period data poisoning, multi-period data poisoning, and model weight poisoning attacks. For fixed-period attacks, malicious clients inject backdoor triggers during rounds 20-40 with poisoning ratio ρ = 0.1 . Multi-period attacks distribute poisoning across three disjoint intervals: rounds 10-15, 25-30, and 45-50, maintaining the same total poisoning budget. Model weight poisoning attacks apply Gaussian perturbations with magnitude δ i t [ 0.001 , 0.01 ] following the decay schedule in Equation (11). Trigger patterns consist of 3×3 pixel patches with intensity variations, and target labels are randomly selected from classes not present in the victim’s local data distribution.

4.2. Evaluation Metrics

We employ three comprehensive metrics to evaluate the performance of our DEFEND framework from multiple perspectives:

4.2.1. Clean Accuracy under Defense

The clean accuracy under defense measures the model’s classification performance on benign test samples when the defense framework is actively protecting against malicious clients, using the held-out validation set D υ :
A c T = 1 | D υ | ( x , y ) D υ I [ f θ T ( x ) = y ] ,
where θ T represents the final global model after T communication rounds, and f θ T denotes the model’s prediction function. Higher values indicate better preservation of normal task performance under defense conditions.

4.2.2. Defense Success Rate

The defense success rate quantifies the effectiveness of our defense framework by measuring the proportion of triggered samples that do not produce the attacker’s target label, using the trigger detection dataset D ξ :
S d T = 1 max y τ 1 | D ξ | ( x ξ , y π ) D ξ I [ f θ T ( x ξ ) = y τ ] ,
where x ξ represents samples with injected triggers, y π are the original labels, and y τ represents the target labels across all possible attack targets. Higher S d T values indicate more effective defense against backdoor attacks.

4.2.3. Temporal Detection Efficiency

The temporal detection efficiency measures the framework’s ability to rapidly and accurately identify malicious clients by considering both detection accuracy and temporal responsiveness:
E d = 1 | C p | c i C p I [ t { t a , i , , T } : s ^ i t = s p s i t = s p ] t d , i t a , i + 1 ,
where t d , i is the communication round when client c i is first correctly classified as poisoned state s p , t a , i is the round when client c i begins malicious behavior, and C p represents the set of malicious clients. The indicator function ensures only successful detections are counted. Higher E d values indicate faster and more accurate malicious client identification.

4.3. Implementation Details

Our DEFEND framework is implemented using PyTorch 2.1.0 and Python 3.9. All experiments are conducted on NVIDIA A100 GPUs with 40GB memory. The MDP-based defense coordination employs Proximal Policy Optimization (PPO) for policy learning with clip ratio 0.2 and entropy coefficient 0.01. The geometric median computation uses the accelerated Weiszfeld algorithm with momentum coefficient α k = 0.9 and convergence tolerance ϵ = 10 6 .
For behavioral profile construction, we extract spectral features using Fast Fourier Transform (FFT) with window size 32, and wavelet coefficients using Daubechies-4 wavelets with 3 decomposition levels. The pattern matching employs Dynamic Time Warping with warping constraint W warp = 5 and Euclidean local distance function.
The validation set D υ comprises 10% of the total training data, randomly sampled and held out from all clients. The trigger detection dataset D ξ contains 1000 samples per class with systematically generated trigger patterns covering various spatial positions and intensities.
Each experimental configuration is repeated 5 times with different random seeds, and results are reported with 95% confidence intervals. Statistical significance is assessed using paired t-tests with Bonferroni correction for multiple comparisons.

5. Results

5.1. MDP Policy Learning Performance

Figure 2 demonstrates the training performance of different reinforcement learning algorithms used in our MDP-based defense coordination framework. We compare three algorithms: Proximal Policy Optimization (PPO), Soft Actor-Critic (SAC), and Genetic Algorithm (GA) across different network architectures and data heterogeneity levels.
The experimental results reveal several important insights about the effectiveness of different reinforcement learning approaches for defense policy optimization. PPO consistently achieves the highest cumulative rewards across all configurations, demonstrating superior learning efficiency and stability in the complex multi-objective defense environment. The algorithm shows rapid convergence within the first 150 episodes and maintains stable performance thereafter, reaching peak rewards around 3800-4000 across different settings.
SAC exhibits competitive performance with gradual learning progression, ultimately achieving comparable final rewards to PPO but requiring more episodes for convergence. The continuous learning curve suggests that SAC benefits from its off-policy nature and entropy regularization, particularly evident in scenarios with higher data heterogeneity (lower α values). The algorithm demonstrates consistent improvement throughout training, reaching final rewards between 3200-3600.
GA shows relatively stable but lower performance compared to the other two approaches, with rewards plateauing around 2000-2800 across all configurations. While GA provides baseline performance guarantees and avoids local optima through population-based exploration, it lacks the sophisticated gradient-based optimization capabilities needed for complex sequential decision-making in the defense scenario.
The impact of data heterogeneity (controlled by α ) appears more pronounced in ResNet-18 configurations compared to MobileNet V2, where PPO maintains consistently high performance regardless of the heterogeneity level. This suggests that the lightweight MobileNet V2 architecture provides more robust defense policy learning under varying data distribution conditions, while ResNet-18 benefits from the additional model capacity when dealing with homogeneous data distributions ( α = 1.0 ).

5.2. Clean Accuracy Preservation Performance

Figure 3 illustrates the evolution of clean accuracy under defense across different reinforcement learning algorithms, network architectures, and data heterogeneity settings. This metric evaluates how well each algorithm maintains model utility for legitimate tasks while defending against temporal backdoor attacks.
The results demonstrate significant differences in how each reinforcement learning algorithm balances security and utility preservation. PPO consistently achieves superior clean accuracy performance, reaching and maintaining accuracy levels between 0.85-0.95 across most configurations after approximately 200 training episodes. The algorithm shows remarkable stability in preserving model utility while implementing defense mechanisms, with particularly strong performance in homogeneous data distributions ( α = 1.0 ).
SAC exhibits competitive clean accuracy preservation with gradual but steady improvement throughout training. The algorithm demonstrates robust performance across heterogeneous data settings, achieving final accuracy values between 0.82-0.92. SAC’s continuous learning approach proves particularly effective in MobileNet V2 configurations, where it matches or occasionally exceeds PPO’s performance, suggesting that the off-policy learning strategy works well with lightweight architectures.
GA shows the most variable performance with significant fluctuations throughout training episodes. While GA occasionally achieves high accuracy spikes (up to 0.95 in some configurations), it struggles to maintain consistent performance, with accuracy frequently dropping to 0.4-0.6 range. This instability indicates that population-based optimization may be less suitable for maintaining the delicate balance between defense effectiveness and model utility preservation in federated learning environments.
The impact of data heterogeneity reveals interesting patterns across architectures. ResNet-18 shows greater sensitivity to heterogeneity levels, with more pronounced performance differences between α = 0.1 and α = 1.0 configurations. In contrast, MobileNet V2 demonstrates more robust performance across different heterogeneity levels, particularly with PPO and SAC algorithms, suggesting that lightweight architectures may provide inherent advantages for federated defense scenarios.
Notably, the clean accuracy preservation performance correlates with the reward optimization patterns observed in Figure 2, confirming that higher cumulative rewards in the MDP framework translate to better utility preservation during defense operations. This validates our multi-objective reward formulation that balances security improvements with model performance maintenance.

5.3. Defense Effectiveness Evaluation

We evaluate the defense effectiveness of our DEFEND framework using two key metrics: Defense Success Rate ( S d T ) and Temporal Detection Efficiency ( E d ) across various system configurations. The following tables present comprehensive results under different combinations of client population sizes, data heterogeneity levels, malicious client ratios, and network architectures.
Table 3 and Table 4 demonstrate that ResNet-18 exhibits strong sensitivity to data heterogeneity levels under fixed malicious conditions ( κ = 0.2). The Defense Success Rate improves substantially from highly heterogeneous ( α = 0.1) to homogeneous ( α = 1.0) distributions, with improvements ranging from 9.1% to 8.8% across different client population sizes. The Temporal Detection Efficiency shows even more pronounced improvements of 14.1% to 13.3%, indicating that homogeneous data distributions significantly enhance the temporal behavioral analysis layer’s ability to identify malicious patterns. The consistent performance gains with increased client population size suggest that ResNet-18 benefits from larger federation scales for improved defense coordination.
Table 5 and Table 6 reveal the critical impact of malicious client ratios on ResNet-18 defense performance under moderate heterogeneity ( α = 0.5). The Defense Success Rate shows a clear linear degradation as malicious ratios increase, with performance dropping from 0.972 to 0.803 (17.4% decrease) at the largest federation size. More concerning is the dramatic decline in Temporal Detection Efficiency, which drops from 0.864 to 0.554 (35.9% decrease), approaching the theoretical limits of Byzantine fault tolerance. This indicates that while our framework maintains reasonable attack prevention capabilities even at high malicious ratios, the speed and accuracy of malicious client identification becomes significantly compromised beyond κ = 0.3.
Table 7 and Table 8 show that MobileNet V2 maintains competitive defense performance despite its lightweight architecture, achieving 10.6% improvement in Defense Success Rate and 14.3% improvement in Temporal Detection Efficiency from heterogeneous to homogeneous distributions. While the absolute values are slightly lower than ResNet-18, MobileNet V2 demonstrates more consistent relative improvements across different client population sizes, with smaller confidence intervals indicating greater stability. The architecture shows particular resilience in heterogeneous environments, making it well-suited for resource-constrained federated deployments where data distribution control is limited.
Table 9 and Table 10 demonstrate that MobileNet V2 exhibits similar vulnerability patterns to ResNet-18 under increasing malicious ratios, but with notably more stable degradation characteristics. The Defense Success Rate decreases by 18.8% from κ = 0.1 to κ = 0.4, while Temporal Detection Efficiency drops by 37.1%, comparable to ResNet-18’s performance degradation. However, MobileNet V2 shows consistently smaller confidence intervals across all configurations, indicating more predictable and stable defense behavior. This stability advantage becomes particularly valuable in dynamic federated environments where malicious client ratios may fluctuate over time, providing more reliable defense guarantees compared to the higher-capacity but more variable ResNet-18 architecture.

6. Conclusions

This paper presents DEFEND, a comprehensive multi-layer defense framework that counters temporal backdoor attacks in federated learning through sophisticated mathematical modeling and reinforcement learning techniques. Our primary contributions include three distinct temporal attack models, a three-tier defense architecture combining behavioral analysis with robust aggregation, and a novel MDP-based approach for adaptive defense coordination. Experimental evaluation demonstrates strong performance with Defense Success Rates reaching 0.956 ± 0.010 for ResNet-18 and 0.940 ± 0.012 for MobileNet V2, while maintaining clean accuracy levels between 0.85-0.95. The framework shows resilience across different data heterogeneity levels and client populations, though performance degrades when malicious ratios exceed 30
The framework has limitations including quadratic computational complexity and reduced effectiveness against highly coordinated attacks approaching Byzantine fault tolerance limits. Future research should focus on developing more efficient algorithms, adaptive threshold mechanisms, and extensions to other domains beyond computer vision. As federated learning expands into critical applications, robust defense mechanisms like DEFEND become essential for maintaining system integrity and user trust in distributed machine learning paradigms.

Appendix A. Byzantine Robustness Analysis

This section establishes the theoretical foundation for the Byzantine robustness of our DEFEND framework through rigorous mathematical analysis of the geometric median aggregation mechanism and outlier detection procedures.

Appendix A.1. Fundamental Byzantine Robustness Theorem

Theorem A1 
(Byzantine Robustness of DEFEND Framework). Consider a federated learning system with N participating clients where at most f clients are Byzantine adversaries satisfying f < N / 2 . Let C h denote honest clients and C p denote Byzantine clients with | C p | = f . Under the DEFEND framework with geometric median aggregation (23) and statistical outlier detection (26), the global model parameter θ t converges to within an ϵ-neighborhood of the optimal solution θ * with high probability.
Specifically, for any ϵ > 0 and confidence parameter δ ( 0 , 1 ) , there exists a finite round T 0 such that for all t T 0 :
P θ t θ * F ϵ + 2 ξ + O log ( 1 / δ ) N 1 δ ,
where ξ bounds the geometric median estimation error, provided:
1. 
The Byzantine client fraction satisfies f ( N 1 ) / 2 ;
2. 
The geometric median approximation error is bounded: θ ^ ν t θ ¯ h t F ξ , where θ ¯ h t represents the mean of honest client updates;
3. 
The outlier detection mechanism achieves controlled error rates: false positive rate α 0.05 and false negative rate β 0.1 .
Proof. 
The proof proceeds through four main steps: (1) establishing the breakdown point properties of geometric median, (2) analyzing the statistical concentration of honest client updates, (3) bounding the outlier detection accuracy, and (4) proving convergence under Byzantine presence.
Step 1: Breakdown Point Analysis of Geometric Median
The geometric median θ ^ ν t defined in (23) possesses a breakdown point of exactly 1 / 2 , meaning it can tolerate up to ( N 1 ) / 2 arbitrary outliers without complete failure. This fundamental property ensures robustness against Byzantine adversaries when f < N / 2 .
For the geometric median computation, let θ ¯ h t = 1 N f i C h θ i t denote the empirical mean of honest client updates. By the robustness property of geometric median, we have:
θ ^ ν t θ ¯ h t F 2 f N f · max i C h θ i t θ ¯ h t F .
Since honest clients follow the true learning dynamics, their updates concentrate around the optimal direction. Under standard federated learning assumptions with bounded gradient variance, we have:
max i C h θ i t θ ¯ h t F σ 2 log ( N ) n min ,
with probability at least 1 N 1 , where σ is the gradient noise parameter and n min is the minimum local dataset size.
Step 2: Statistical Concentration of Honest Updates
For honest clients c i C h , their local model updates θ i t are generated through standard gradient descent on local data. Under the assumption of sub-Gaussian gradient noise with parameter σ 2 , we can establish concentration bounds.
Let F i ( θ t 1 ) denote the true local gradient for client c i . The empirical gradient computed from local data satisfies:
P F ^ i ( θ t 1 ) F i ( θ t 1 ) F t 2 exp n i t 2 2 σ 2 d ,
where d is the model parameter dimension and n i is the local dataset size for client c i .
The local update deviation from the ideal direction can be bounded by:
θ i t θ ideal t F η F ^ i ( θ t 1 ) F i ( θ t 1 ) F + η L F θ t 1 θ * F ,
where η is the learning rate and L F is the Lipschitz constant from the smoothness assumption.
Step 3: Outlier Detection Accuracy Analysis
Our outlier detection mechanism (26) identifies suspicious clients based on their distance from the geometric median. For a client c i , define the detection statistic:
d i t = θ i t θ ^ ν t F .
Under the null hypothesis that client c i is honest, d i t follows a distribution characterized by the concentration properties established in Step 2. The detection threshold τ α is set based on quantiles of this null distribution as defined in (26).
For honest clients, the false positive probability is bounded by:
P [ c i S t | c i C h ] α + exp n i ( τ α E [ d i t ] ) 2 2 σ 2 d ,
where α is the nominal false positive rate.
For Byzantine clients with significantly deviating updates, the detection probability satisfies:
P [ c j S t | c j C p ] 1 β ,
provided their update magnitude exceeds the detection threshold by a sufficient margin, where β is the false negative rate.
Step 4: Convergence Analysis Under Byzantine Presence
After outlier detection and removal, the weighted aggregation operates on the filtered client set. With high probability 1 δ , the set { c 1 , , c N } S t contains predominantly honest clients.
The weighted aggregation (27) yields:
θ t = arg min θ i S t w i t θ θ i t F 2 + λ τ θ θ t 1 F 2 + λ ν θ θ ^ ν t F 2 .
The solution can be expressed as:
θ t = i S t w i t θ i t + λ τ θ t 1 + λ ν θ ^ ν t i S t w i t + λ τ + λ ν .
Since the filtered set predominantly contains honest clients, we can decompose the aggregation error as:
θ t θ * F θ t E [ θ t ] F + E [ θ t ] θ * F .
The concentration term is bounded using Azuma-Hoeffding inequality for weighted sums:
θ t E [ θ t ] F O log ( 1 / δ ) | { c 1 , , c N } S t | ,
with probability at least 1 δ .
The bias term is controlled by the geometric median approximation error and temporal regularization:
E [ θ t ] θ * F 2 ξ + λ τ λ ν θ t 1 θ * F .
The temporal regularization coefficient λ τ / λ ν < 1 ensures contraction, leading to convergence as t .
Combining the concentration and bias bounds establishes (A1), completing the proof. □

Appendix A.2. Corollaries and Extensions

Corollary A1 
(Finite-Sample Convergence Rate). Under the conditions of Theorem A1, the DEFEND framework achieves ϵ-convergence in at most
T ( ϵ , δ ) = O log ( θ 0 θ * F / ϵ ) log ( 1 / ( 1 λ τ / λ ν ) ) + log ( 1 / δ )
communication rounds with probability at least 1 δ .
Proof. 
The proof follows from iterating the contraction property in (A13) and using the union bound over all communication rounds. □
Corollary A2 
(Optimality of Byzantine Tolerance). The Byzantine tolerance threshold f < N / 2 in Theorem A1 is optimal. No algorithm can guarantee convergence when f N / 2 in the worst case.
Proof. 
This follows from the fundamental impossibility results in Byzantine fault tolerance. When f N / 2 , Byzantine clients can coordinate to completely overwhelm honest clients, making any aggregation rule vulnerable to manipulation. □

Appendix A.3. Robustness Under Stronger Attack Models

Lemma A1 
(Robustness Against Coordinated Attacks). The DEFEND framework maintains its robustness guarantees even when Byzantine clients coordinate their attacks, provided they cannot observe the geometric median computation in real-time.
Proof. 
Coordinated attacks can increase the correlation between Byzantine updates but cannot change the fundamental breakdown point of the geometric median. The proof follows the same structure as Theorem A1 with modified concentration bounds for correlated adversarial behavior. □

Appendix B. Weiszfeld Algorithm Convergence Analysis

This section provides a comprehensive theoretical analysis of the convergence properties of the accelerated Weiszfeld algorithm used for geometric median computation in our DEFEND framework.

Appendix B.1. Preliminaries and Algorithm Description

The Weiszfeld algorithm solves the geometric median problem defined in (23):
θ ^ ν t = arg min θ i = 1 N θ θ i t F .
Our accelerated version incorporates momentum-based acceleration as defined in (24) and (25). Let { θ 1 t , , θ N t } denote the set of client updates at communication round t, and assume they are distinct with probability 1.
Definition A1 
(Weiszfeld Iteration Operator). The Weiszfeld iteration operator T : R d R d is defined as:
T ( θ ) = i = 1 N θ i t θ θ i t F + ϵ i = 1 N 1 θ θ i t F + ϵ ,
where ϵ > 0 is the regularization parameter to avoid division by zero.

Appendix B.2. Main Convergence Theorem

Theorem A2 
(Convergence of Accelerated Weiszfeld Algorithm). Consider the accelerated Weiszfeld algorithm defined by (24) and (25) applied to compute the geometric median of client updates { θ 1 t , , θ N t } R d . Let θ ^ ν t * denote the unique geometric median. Then:
1. 
Linear Convergence: The algorithm converges linearly to θ ^ ν t * with rate ρ ( 0 , 1 ) where
θ ( k + 1 ) θ ^ ν t * F ρ θ ( k ) θ ^ ν t * F ;
2. 
Iteration Complexity: The number of iterations required to achieve ϵ-accuracy is
K ( ϵ ) = O κ log θ ( 0 ) θ ^ ν t * F ϵ ,
where κ is the condition number of the problem;
3. 
Acceleration Benefit: The momentum acceleration reduces the convergence constant by a factor of ( 1 μ / L ) where μ and L are the strong convexity and Lipschitz parameters.
Proof. 
The proof is structured in four main parts: (1) establishing strong convexity of the geometric median objective, (2) proving contraction properties of the Weiszfeld operator, (3) analyzing momentum acceleration, and (4) deriving iteration complexity bounds.
Part 1: Strong Convexity Analysis
The geometric median objective function is:
f ( θ ) = i = 1 N θ θ i t F .
For θ θ i t (which occurs with probability 1), the function f is twice differentiable. The Hessian matrix is:
2 f ( θ ) = i = 1 N 1 θ θ i t F I ( θ θ i t ) ( θ θ i t ) T θ θ i t F 2 .
Lemma A2 (Strong Convexity Parameter). 
The objective function f ( θ ) is strongly convex with parameter
μ = min i = 1 , , N 1 θ θ i t F + ϵ 1 D max + ϵ ,
where D max = max i , j θ i t θ j t F is the diameter of the client update set.
Proof of Lemma A2. 
Each term in the Hessian corresponds to a projection onto the orthogonal complement of ( θ θ i t ) . The minimum eigenvalue is achieved when θ is furthest from all client updates, giving the stated bound. □
Part 2: Contraction Analysis of Weiszfeld Operator
The Weiszfeld operator can be interpreted as a proximal gradient step. Define the subdifferential of f at θ :
𝜕 f ( θ ) = i = 1 N θ θ i t θ θ i t F + ϵ .
Lemma A3 (Contraction Property). 
The Weiszfeld operator T is a contraction mapping with contraction factor
ρ b a s e = 1 μ L 1 1 L ( D max + ϵ ) ,
where L is the Lipschitz constant of f .
Proof of Lemma A3. 
The Weiszfeld iteration can be written as:
θ ( k + 1 ) = θ ( k ) α ( k ) f ( θ ( k ) ) ,
where α ( k ) = i = 1 N 1 θ ( k ) θ i t F + ϵ 1 is an adaptive step size.
By the strong convexity of f, we have:
f ( θ ( k + 1 ) ) f ( θ ( k ) ) μ 2 θ ( k + 1 ) θ ( k ) F 2 .
The optimality condition f ( θ ^ ν t * ) = 0 combined with strong convexity yields:
θ ( k + 1 ) θ ^ ν t * F 2 ( 1 μ / L ) θ ( k ) θ ^ ν t * F 2 .
Part 3: Momentum Acceleration Analysis
The momentum-accelerated version from (25) follows the heavy ball method:
θ ( k + 1 ) θ ( k + 1 ) + α k ( θ ( k + 1 ) θ ( k ) ) ,
where α k is the momentum coefficient.
Lemma A4 (Acceleration Benefit). 
With optimally chosen momentum parameter
α k = L μ L + μ ,
the accelerated convergence rate becomes
ρ acc = L μ L + μ = κ 1 κ + 1 ,
where κ = L / μ is the condition number.
Proof of Lemma A4. 
The momentum method can be analyzed using the potential function approach. Define:
Φ ( k ) = f ( θ ( k ) ) f ( θ ^ ν t * ) + β 2 θ ( k ) θ ( k 1 ) F 2 ,
for appropriate constant β > 0 .
The momentum parameter choice ensures:
Φ ( k + 1 ) κ 1 κ + 1 2 Φ ( k ) .
This implies the stated convergence rate for the function values, which translates to the parameter convergence rate through strong convexity. □
Part 4: Iteration Complexity Derivation
To achieve ϵ -accuracy: θ ( k ) θ ^ ν t * F ϵ , we need:
ρ acc k θ ( 0 ) θ ^ ν t * F ϵ .
Taking logarithms:
k log ( θ ( 0 ) θ ^ ν t * F / ϵ ) log ( 1 / ρ acc ) .
Using the approximation log ( 1 / ρ acc ) 2 / κ for large κ :
k = O κ log θ ( 0 ) θ ^ ν t * F ϵ .
This completes the proof of all three statements in Theorem A2.□

Appendix B.3. Practical Implementation Considerations

Lemma A5 
(Numerical Stability). The regularization parameter ϵ in Definition A1 can be chosen as ϵ = O ( 10 6 ) without significantly affecting the convergence rate, provided the condition number κ is bounded.
Proof. 
The regularization affects the strong convexity parameter by at most ϵ / D max , which is negligible for practical choices of ϵ relative to the client update diameter D max . □
Corollary A3 
(Distributed Implementation). The Weiszfeld algorithm can be implemented in a distributed manner where each iteration requires only O ( N ) communication complexity to exchange current iterate and compute weighted average.
Proof. 
Each iteration of the Weiszfeld algorithm requires computing the weighted average in (A16), which can be done with a single round of communication where each client sends its current update θ i t and receives the current iterate θ ( k ) . □

Appendix B.4. Robustness Under Approximate Computation

Theorem A3 
(Robustness to Computation Errors). Suppose each Weiszfeld iteration is computed with additive error e ( k ) satisfying e ( k ) F δ for some δ > 0 . Then the algorithm still converges to within O ( δ / μ ) of the true geometric median θ ^ ν t * .
Proof. 
The proof follows by modifying the contraction analysis in Lemma A3 to account for the additional error terms in each iteration. The modified iteration becomes:
θ ( k + 1 ) = T ( θ ( k ) ) + e ( k ) .
The contraction property still holds with an additional bias term:
θ ( k + 1 ) θ ^ ν t * F ρ θ ( k ) θ ^ ν t * F + δ .
Taking the limit as k shows that the algorithm converges to within δ / ( 1 ρ ) = O ( δ / μ ) of the true geometric median. □

Appendix B.5. Integration with DEFEND Framework

Corollary A4 
(Convergence in DEFEND Context). When integrated into the DEFEND framework, the Weiszfeld algorithm for computing θ ^ ν t in (23) achieves the complexity bound from Theorem A2 at each communication round t, with the condition number κ determined by the spread of honest client updates.
Proof. 
The condition number κ in each communication round depends on the ratio L / μ where L and μ are determined by the distribution of client updates { θ 1 t , , θ N t } . Under the assumptions of bounded client update variance, κ remains bounded across communication rounds, ensuring consistent convergence performance. □

References

  1. Wen, J.; Zhang, Z.; Lan, Y.; Cui, Z.; Cai, J.; Zhang, W. A Survey on Federated Learning: Challenges and Applications. International Journal of Machine Learning and Cybernetics 2023, 14, 513–535. [Google Scholar] [CrossRef] [PubMed]
  2. Abadi, A.; Doyle, B.; Gini, F.; Guinamard, K.; Murakonda, S.K.; Liddell, J.; Mellor, P.; Murdoch, S.J.; Naseri, M.; Page, H.; et al. Starlit: Privacy-Preserving Federated Learning to Enhance Financial Fraud Detection. arXiv preprint arXiv:2401.10765 (arXiv) 2024. [CrossRef]
  3. Wu, B.; Huang, J.; Yu, S. "X of Information" Continuum: A Survey on AI-Driven Multi-Dimensional Metrics for Next-Generation Networked Systems. arXiv preprint arXiv:2507.19657 2025. [CrossRef]
  4. Gong, X.; Chen, Y.; Wang, Q.; Kong, W. Backdoor Attacks and Defenses in Federated Learning: State-of-the-Art, Taxonomy, and Future Directions. IEEE Wireless Communications 2023, 30, 114–121. [Google Scholar] [CrossRef]
  5. Wu, B.; Huang, J.; Duan, Q.; Dong, L.; Cai, Z. Enhancing Vehicular Platooning With Wireless Federated Learning: A Resource-Aware Control Framework. arXiv preprint arXiv:2507.00856 2025. [CrossRef]
  6. Liu, T.; Zhang, Y.; Feng, Z.; Yang, Z.; Xu, C.; Man, D.; Yang, W. Beyond Traditional Threats: A Persistent Backdoor Attack on Federated Learning. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). AAAI, 2024, Vol. 38, pp. 21359–21367. [CrossRef]
  7. Fang, Z.; Wang, J.; Ma, Y.; Tao, Y.; Deng, Y.; Chen, X.; Fang, Y. R-ACP: Real-Time Adaptive Collaborative Perception Leveraging Robust Task-Oriented Communications. IEEE Journal on Selected Areas in Communications 2025. [CrossRef]
  8. Ozdayi, M.S.; Kantarcioglu, M.; Gel, Y.R. Defending Against Backdoors in Federated Learning With Robust Learning Rate. In Proceedings of the Proceedings of the AAAI Conference on Artificial Intelligence (AAAI). AAAI, 2021, Vol. 35, pp. 9268–9276. [CrossRef]
  9. Fang, Z.; Hu, S.; Wang, J.; Deng, Y.; Chen, X.; Fang, Y. Prioritized Information Bottleneck Theoretic Framework With Distributed Online Learning for Edge Video Analytics. IEEE Transactions on Networking 2025, pp. 1–17. [CrossRef]
  10. Wu, B.; Cai, Z.; Wu, W.; Yin, X. AoI-Aware Resource Management for Smart Health via Deep Reinforcement Learning. IEEE Access 2023. [Google Scholar] [CrossRef]
  11. Chen, Z.; Yu, S.; Fan, M.; Liu, X.; Deng, R.H. Privacy-Enhancing and Robust Backdoor Defense for Federated Learning on Heterogeneous Data. IEEE Transactions on Information Forensics and Security 2024, 19, 693–707. [Google Scholar] [CrossRef]
  12. Sewak, M.; Sahay, S.K.; Rathore, H. Deep Reinforcement Learning in the Advanced Cybersecurity Threat Detection and Protection. Information Systems Frontiers 2023, 25, 589–611. [Google Scholar] [CrossRef]
  13. Chen, C.; Liu, J.; Tan, H.; Li, X.; Wang, K.I.K.; Li, P.; Sakurai, K.; Dou, D. Trustworthy Federated Learning: Privacy, Security, and Beyond. Knowledge and Information Systems 2025, 67, 2321–2356. [Google Scholar] [CrossRef]
  14. Aljunaid, S.K.; Almheiri, S.J.; Dawood, H.; Khan, M.A. Secure and Transparent Banking: Explainable AI-Driven Federated Learning Model for Financial Fraud Detection. Journal of Risk and Financial Management 2025, 18, 179. [Google Scholar] [CrossRef]
  15. Hallaji, E.; Razavi-Far, R.; Saif, M.; Wang, B.; Yang, Q. Decentralized Federated Learning: A Survey on Security and Privacy. IEEE Transactions on Big Data 2024, 10, 194–213. [Google Scholar] [CrossRef]
  16. Pingulkar, S.; Pawade, D. Federated Learning Architectures for Credit Risk Assessment: A Comparative Analysis of Vertical, Horizontal, and Transfer Learning Approaches. In Proceedings of the 2024 IEEE International Conference on Blockchain and Distributed Systems Security (ICBDS). IEEE, 2024, pp. 1–7. [CrossRef]
  17. Damoun, F.; Seba, H.; State, R. Privacy-Preserving Behavioral Anomaly Detection in Dynamic Graphs for Card Transactions. In Proceedings of the International Conference on Web Information Systems Engineering (WISE). Springer, 2024, pp. 286–301. [CrossRef]
  18. Wu, B.; Wu, W. Model-Free Cooperative Optimal Output Regulation for Linear Discrete-Time Multi-Agent Systems Using Reinforcement Learning. Mathematical Problems in Engineering 2023, 2023, 6350647. [Google Scholar] [CrossRef]
  19. Ding, Z.; Huang, J.; Duan, Q.; Zhang, C.; Zhao, Y.; Gu, S. A Dual-Level Game-Theoretic Approach for Collaborative Learning in UAV-Assisted Heterogeneous Vehicle Networks. In Proceedings of the 2025 IEEE International Performance, Computing, and Communications Conference (IPCCC). IEEE, 2025, pp. 1–8.
  20. Xiong, H.; Xia, Y.; Zhao, Y.; Wahaballa, A.; Yeh, K.H. Heterogeneous Privacy-Preserving Blockchain-Enabled Federated Learning for Social Fintech. IEEE Transactions on Computational Social Systems 2025, pp. 1–16. [CrossRef]
  21. Wu, B.; Ding, Z.; Ostigaard, L.; Huang, J. Reinforcement Learning-Based Energy-Aware Coverage Path Planning for Precision Agriculture. In Proceedings of the 2025 ACM Research on Adaptive and Convergent Systems (RACS). ACM, 2025, pp. 1–8.
  22. Orabi, M.M.; Emam, O.; Fahmy, H. Adapting Security and Decentralized Knowledge Enhancement in Federated Learning Using Blockchain Technology: Literature Review. Journal of Big Data 2025, 12, 55. [Google Scholar] [CrossRef]
  23. Uddin, M.P.; Xiang, Y.; Hasan, M.; Bai, J.; Zhao, Y.; Gao, L. A Systematic Literature Review of Robust Federated Learning: Issues, Solutions, and Future Research Directions. ACM Computing Surveys 2025, 57, 1–62. [Google Scholar] [CrossRef]
  24. Duan, G.; Lv, H.; Wang, H.; Feng, G.; Li, X. Practical Cyber Attack Detection With Continuous Temporal Graph in Dynamic Network System. IEEE Transactions on Information Forensics and Security 2024, 19, 4851–4864. [Google Scholar] [CrossRef]
  25. Zamanzadeh Darban, Z.; Webb, G.I.; Pan, S.; Aggarwal, C.; Salehi, M. Deep Learning for Time Series Anomaly Detection: A Survey. ACM Computing Surveys 2024, 57, 1–42. [Google Scholar] [CrossRef]
  26. Bello, Y.; Hussein, A.R. Dynamic Policy Decision/Enforcement Security Zoning Through Stochastic Games and Meta Learning. IEEE Transactions on Network and Service Management 2025, 22, 807–821. [Google Scholar] [CrossRef]
  27. Huang, W.; Shi, Z.; Ye, M.; Li, H.; Du, B. Self-Driven Entropy Aggregation for Byzantine-Robust Heterogeneous Federated Learning. In Proceedings of the Proceedings of the Forty-First International Conference on Machine Learning (ICML). PMLR, 2024.
  28. Li, Y.; Wang, Y.; Chen, Z.; Yuan, H. A Multi-Layer Aggregation Backdoor Defense Framework for Federated Learning. In Proceedings of the 2025 International Conference on Communication, Remote Sensing and Information Technology (CRSIT). IEEE, 2025, pp. 126–132. [CrossRef]
  29. Abazari, A.; Ghafouri, M.; Jafarigiv, D.; Atallah, R.; Assi, C. Developing a Security Metric for Assessing the Power Grid’s Posture Against Attacks From EV Charging Ecosystem. IEEE Transactions on Smart Grid 2025, 16, 254–276. [Google Scholar] [CrossRef]
  30. Presekal, A.; Ştefanov, A.; Semertzis, I.; Palensky, P. Spatio-Temporal Advanced Persistent Threat Detection and Correlation for Cyber-Physical Power Systems Using Enhanced GC-LSTM. IEEE Transactions on Smart Grid 2025, 16, 1654–1666. [Google Scholar] [CrossRef]
  31. Wu, Y.; Hu, Y.; Wang, J.; Feng, M.; Dong, A.; Yang, Y. An Active Learning Framework Using Deep Q-Network for Zero-Day Attack Detection. Computers & Security 2024, 139, 103713. [Google Scholar]
  32. Pan, D.; Wu, B.N.; Sun, Y.L.; Xu, Y.P. A Fault-Tolerant and Energy-Efficient Design of a Network Switch Based on a Quantum-Based Nano-Communication Technique. Sustainable Computing: Informatics and Systems 2023, 37, 100827. [Google Scholar] [CrossRef]
  33. Hammad, A.A.; Ahmed, S.R.; Abdul-Hussein, M.K.; Ahmed, M.R.; Majeed, D.A.; Algburi, S. Deep Reinforcement Learning for Adaptive Cyber Defense in Network Security. In Proceedings of the Proceedings of the Cognitive Models and Artificial Intelligence Conference (CMAI). ACM, 2024, pp. 292–297. [CrossRef]
  34. Wu, B.; Huang, J.; Duan, Q. FedTD3: An Accelerated Learning Approach for UAV Trajectory Planning. In Proceedings of the International Conference on Wireless Artificial Intelligent Computing Systems and Applications (WASA). Springer, 2025, pp. 13–24. [CrossRef]
  35. Wu, B.; Huang, J.; Duan, Q. Real-Time Intelligent Healthcare Enabled by Federated Digital Twins With AoI Optimization. IEEE Network 2025, pp. 1–1. [CrossRef]
  36. Paparrizos, J.; Boniol, P.; Liu, Q.; Palpanas, T. Advances in Time-Series Anomaly Detection: Algorithms, Benchmarks, and Evaluation Measures. In Proceedings of the Proceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining (KDD). ACM, 2025, pp. 6151–6161. [CrossRef]
  37. Feng, X.; Han, J.; Zhang, R.; Xu, S.; Xia, H. Security Defense Strategy Algorithm for Internet of Things Based on Deep Reinforcement Learning. High-Confidence Computing 2024, 4, 100167. [Google Scholar] [CrossRef]
  38. Fang, Z.; Wang, J.; Ren, Y.; Han, Z.; Poor, H.V.; Hanzo, L. Age of information in energy harvesting aided massive multiple access networks. IEEE Journal on Selected Areas in Communications 2022, 40, 1441–1456. [Google Scholar] [CrossRef]
  39. Farhaoui, Y.; Allaoui, A.E.; Amounas, F.; Mohammed, F.; Ziani, S.; Taherdoost, H.; Triantafyllou, S.A.; Bhushan, B. A Multi-Layered Protection System for Enhancing Data Security in Cloud Computing Environments. In Proceedings of the International Conference on Artificial Intelligence and Smart Environment (AISE). Springer, 2024, pp. 559–568. [CrossRef]
  40. Feng, J.; Ren, Z.; Li, C.; Li, W. A Benders-Combined Safe Reinforcement Learning Framework for Risk-Averse Dispatch Considering Frequency Security Constraints. IEEE Transactions on Circuits and Systems II: Express Briefs 2025, 72, 1063–1067. [Google Scholar] [CrossRef]
  41. Fang, Z.; Liu, Z.; Wang, J.; Hu, S.; Guo, Y.; Deng, Y.; Fang, Y. Task-Oriented Communications for Visual Navigation With Edge-Aerial Collaboration in Low Altitude Economy. arXiv preprint arXiv:2504.18317 (arXiv) 2025. [CrossRef]
  42. Huang, J.; Wu, B.; Duan, Q.; Dong, L.; Yu, S. A Fast UAV Trajectory Planning Framework in RIS-Assisted Communication Systems With Accelerated Learning via Multithreading and Federating. IEEE Transactions on Mobile Computing 2025, pp. 1–16. [CrossRef]
  43. Pan, D.; Wu, B.N.; Sun, Y.L.; Xu, Y.P. A fault-tolerant and energy-efficient design of a network switch based on a quantum-based nano-communication technique. Sustainable Computing: Informatics and Systems 2023, 37, 100827. [Google Scholar] [CrossRef]
Figure 1. DEFEND Framework.
Figure 1. DEFEND Framework.
Preprints 179169 g001
Figure 2. Training curves of different reinforcement learning algorithms for MDP-based defense policy optimization across various network architectures and data heterogeneity levels. The curves show cumulative reward over training episodes, where α represents the Dirichlet concentration parameter controlling data distribution heterogeneity (lower values indicate higher heterogeneity).
Figure 2. Training curves of different reinforcement learning algorithms for MDP-based defense policy optimization across various network architectures and data heterogeneity levels. The curves show cumulative reward over training episodes, where α represents the Dirichlet concentration parameter controlling data distribution heterogeneity (lower values indicate higher heterogeneity).
Preprints 179169 g002
Figure 3. Clean accuracy under defense performance comparison across different reinforcement learning algorithms, network architectures, and data heterogeneity levels. The curves demonstrate each algorithm’s ability to preserve model utility for legitimate classification tasks while actively defending against temporal backdoor attacks during training.
Figure 3. Clean accuracy under defense performance comparison across different reinforcement learning algorithms, network architectures, and data heterogeneity levels. The curves demonstrate each algorithm’s ability to preserve model utility for legitimate classification tasks while actively defending against temporal backdoor attacks during training.
Preprints 179169 g003
Table 2. Experimental Parameters and Configuration Settings.
Table 2. Experimental Parameters and Configuration Settings.
Parameter Symbol Value
Dataset and Client Configuration
Datasets CIFAR-10, FEMNIST, MNIST
Number of clients N 10, 20, 30, 40, 50
Data distribution Non-IID
Dirichlet concentration α 0.1, 0.5, 1.0
Malicious client ratio κ 0.1, 0.2, 0.3, 0.4
Model architecture MobileNet V2, ResNet-18
Federated Learning Parameters
Learning rate η 0.01
Local epochs 5
Communication rounds T 100
Batch size 32
Poisoning ratio ρ 0.1
Defense Framework Parameters
Short window size w f 5
Long window size w s 10
Detection threshold τ α 0.8
Temporal regularization λ τ 0.1
Pattern templates | P Θ | 10
Geometric median tolerance ϵ 10 6
Reputation decay scales K r 3
MDP and Reinforcement Learning Parameters
Discount factor γ 0.95
Policy learning rate 0.001
Training episodes 500
Max steps per episode 1000
Exploration rate ϵ 0.1
Experience buffer size 10,000
Target network update 1000 steps
Minimum participation N min 0.6 N
Table 3. Defense Success Rate for ResNet-18 under varying client numbers and data heterogeneity ( κ = 0.2).
Table 3. Defense Success Rate for ResNet-18 under varying client numbers and data heterogeneity ( κ = 0.2).
N α = 0.1 α = 0.5 α = 1.0
10 0.847 ± 0.023 0.892 ± 0.019 0.924 ± 0.015
20 0.863 ± 0.021 0.906 ± 0.017 0.938 ± 0.013
30 0.871 ± 0.019 0.915 ± 0.016 0.947 ± 0.012
40 0.876 ± 0.018 0.921 ± 0.015 0.952 ± 0.011
50 0.879 ± 0.017 0.925 ± 0.014 0.956 ± 0.010
Table 4. Temporal Detection Efficiency for ResNet-18 under varying client numbers and data heterogeneity ( κ = 0.2).
Table 4. Temporal Detection Efficiency for ResNet-18 under varying client numbers and data heterogeneity ( κ = 0.2).
N α = 0.1 α = 0.5 α = 1.0
10 0.673 ± 0.041 0.721 ± 0.038 0.768 ± 0.033
20 0.695 ± 0.039 0.742 ± 0.035 0.789 ± 0.031
30 0.708 ± 0.037 0.755 ± 0.033 0.802 ± 0.029
40 0.716 ± 0.036 0.763 ± 0.032 0.811 ± 0.028
50 0.721 ± 0.035 0.769 ± 0.031 0.817 ± 0.027
Table 5. Defense Success Rate for ResNet-18 under varying client numbers and malicious ratios ( α = 0.5).
Table 5. Defense Success Rate for ResNet-18 under varying client numbers and malicious ratios ( α = 0.5).
N κ = 0.1 κ = 0.2 κ = 0.3 κ = 0.4
10 0.943 ± 0.012 0.892 ± 0.019 0.834 ± 0.026 0.768 ± 0.032
20 0.957 ± 0.011 0.906 ± 0.017 0.847 ± 0.024 0.781 ± 0.030
30 0.964 ± 0.010 0.915 ± 0.016 0.856 ± 0.023 0.791 ± 0.029
40 0.969 ± 0.009 0.921 ± 0.015 0.863 ± 0.022 0.798 ± 0.028
50 0.972 ± 0.009 0.925 ± 0.014 0.867 ± 0.021 0.803 ± 0.027
Table 6. Temporal Detection Efficiency for ResNet-18 under varying client numbers and malicious ratios ( α = 0.5).
Table 6. Temporal Detection Efficiency for ResNet-18 under varying client numbers and malicious ratios ( α = 0.5).
N κ = 0.1 κ = 0.2 κ = 0.3 κ = 0.4
10 0.825 ± 0.028 0.721 ± 0.038 0.612 ± 0.045 0.498 ± 0.052
20 0.841 ± 0.026 0.742 ± 0.035 0.634 ± 0.042 0.521 ± 0.049
30 0.852 ± 0.025 0.755 ± 0.033 0.649 ± 0.040 0.537 ± 0.047
40 0.859 ± 0.024 0.763 ± 0.032 0.658 ± 0.039 0.547 ± 0.046
50 0.864 ± 0.023 0.769 ± 0.031 0.664 ± 0.038 0.554 ± 0.045
Table 7. Defense Success Rate for MobileNet V2 under varying client numbers and data heterogeneity ( κ = 0.2)
Table 7. Defense Success Rate for MobileNet V2 under varying client numbers and data heterogeneity ( κ = 0.2)
N α = 0.1 α = 0.5 α = 1.0
10 0.821 ± 0.025 0.869 ± 0.021 0.908 ± 0.017
20 0.836 ± 0.023 0.883 ± 0.019 0.921 ± 0.015
30 0.845 ± 0.022 0.892 ± 0.018 0.930 ± 0.014
40 0.851 ± 0.021 0.898 ± 0.017 0.936 ± 0.013
50 0.855 ± 0.020 0.902 ± 0.016 0.940 ± 0.012
Table 8. Temporal Detection Efficiency for MobileNet V2 under varying client numbers and data heterogeneity ( κ = 0.2)
Table 8. Temporal Detection Efficiency for MobileNet V2 under varying client numbers and data heterogeneity ( κ = 0.2)
N α = 0.1 α = 0.5 α = 1.0
10 0.657 ± 0.043 0.703 ± 0.040 0.751 ± 0.036
20 0.678 ± 0.041 0.724 ± 0.038 0.772 ± 0.034
30 0.691 ± 0.039 0.737 ± 0.036 0.785 ± 0.032
40 0.699 ± 0.038 0.745 ± 0.035 0.793 ± 0.031
50 0.704 ± 0.037 0.751 ± 0.034 0.799 ± 0.030
Table 9. Defense Success Rate for MobileNet V2 under varying client numbers and malicious ratios ( α = 0.5).
Table 9. Defense Success Rate for MobileNet V2 under varying client numbers and malicious ratios ( α = 0.5).
N κ = 0.1 κ = 0.2 κ = 0.3 κ = 0.4
10 0.926 ± 0.014 0.869 ± 0.021 0.807 ± 0.028 0.739 ± 0.035
20 0.940 ± 0.013 0.883 ± 0.019 0.821 ± 0.026 0.753 ± 0.033
30 0.947 ± 0.012 0.892 ± 0.018 0.831 ± 0.025 0.763 ± 0.032
40 0.952 ± 0.011 0.898 ± 0.017 0.838 ± 0.024 0.770 ± 0.031
50 0.955 ± 0.010 0.902 ± 0.016 0.843 ± 0.023 0.775 ± 0.030
Table 10. Temporal Detection Efficiency for MobileNet V2 under varying client numbers and malicious ratios ( α = 0.5)
Table 10. Temporal Detection Efficiency for MobileNet V2 under varying client numbers and malicious ratios ( α = 0.5)
N κ = 0.1 κ = 0.2 κ = 0.3 κ = 0.4
10 0.809 ± 0.030 0.703 ± 0.040 0.591 ± 0.047 0.474 ± 0.054
20 0.825 ± 0.028 0.724 ± 0.038 0.613 ± 0.045 0.497 ± 0.052
30 0.836 ± 0.027 0.737 ± 0.036 0.628 ± 0.043 0.514 ± 0.050
40 0.843 ± 0.026 0.745 ± 0.035 0.637 ± 0.042 0.525 ± 0.049
50 0.848 ± 0.025 0.751 ± 0.034 0.644 ± 0.041 0.533 ± 0.048
Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.
Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.
Prerpints.org logo

Preprints.org is a free preprint server supported by MDPI in Basel, Switzerland.

Subscribe

Disclaimer

Terms of Use

Privacy Policy

Privacy Settings

© 2025 MDPI (Basel, Switzerland) unless otherwise stated