A Comprehensive Survey of Federated Model Evolution in Open Environments

Ji Wang; Zhengyi Zhong; Jingxuan Zhou; Jiahui Ling; Jingyi Leng; Wenzheng Jiang; Weidong Bao

doi:10.20944/preprints202606.0652.v1

Submitted:

04 June 2026

Posted:

09 June 2026

You are already at the latest version

Abstract

With the rapid development of the Internet of Things and edge computing, federated learning (FL) has become an important technique for collaborative intelligent modeling across distributed nodes. However, in real-world open environments, federated systems must continuously cope with dynamically joining devices, evolving service requirements, and privacy-driven data deletion, making traditional federated training paradigms built on static assumptions increasingly inadequate. To address this challenge, this paper provides a systematic survey of key research progress on federated learning from the perspective of continuous model evolution in open environments. Specifically, we analyze three core problems faced by federated systems in open environments: how to efficiently adapt to newly arrived nodes or data distributions, how to preserve performance on historical tasks during continual training, and how to effectively remove the contribution of specific data or nodes under privacy and compliance requirements. We then organize the literature around three major technical routes, namely federated domain adaptation, federated continual learning, and federated unlearning. On this basis, we further summarize commonly used evaluation metrics and mainstream experimental benchmarks in this area. Finally, we discuss the major challenges and future research directions of federated model evolution in open environments, with the goal of providing a systematic reference for building adaptive and sustainably evolving distributed intelligent systems.

Keywords:

model evolution

;

federated learning

;

open environments

;

continual learning

;

unlearning

;

domain adaptation

Subject:

Computer Science and Mathematics - Artificial Intelligence and Machine Learning

1. Introduction

With the deep integration of the Internet of Things and artificial intelligence technologies, end devices such as smartphones have experienced explosive growth in both quantity and sensing capability. Against this backdrop, end devices are increasingly expected to undertake complex computing tasks. However, if the massive data streams generated by these devices are processed entirely through traditional centralized cloud computing, severe bottlenecks arise in network bandwidth, transmission latency, and energy efficiency. Among various implementations of distributed intelligent computing architectures, federated learning (FL) [1], as an emerging distributed machine learning paradigm, has attracted extensive attention [2,3]. Unlike conventional centralized training, which requires aggregating data at a central server, federated learning allows end devices to train models locally using their private data and upload only model parameters or gradient updates to the server for aggregation. This “data-local, model-global” mechanism not only effectively avoids the privacy leakage risks caused by raw data transmission, but also significantly reduces dependence on communication bandwidth, making FL an important solution to the data-island problem in large-scale distributed networks.

Traditional FL is based on closed and static settings. In real open environments, however, distributed systems evolve in a continuous streaming manner, and both nodes and data keep flowing into and out of the system. As shown in Figure 1, new nodes carrying new knowledge continuously join the federated training (❶), and existing nodes may also generate new tasks or new data streams (❷), forming inflow at both the node and data levels. Meanwhile, existing devices may request the removal of their influence on the global model due to privacy concerns (❸), while outdated data or privacy-sensitive information subject to the “right to be forgotten” should also be removed from trained models (❹), forming outflow at both levels. These dynamic streaming characteristics imply that FL is no longer a closed optimization process, but rather an open evolutionary process. The system must not only absorb newly joined nodes to extend knowledge boundaries, but also handle the retirement of obsolete tasks or the removal of sensitive information, thereby maintaining model timeliness and robustness amid continuous inflow and outflow.

Challenges. Facing the above bidirectional dynamic data flow in open environments, traditional static training paradigms no longer work. To maintain the real-time utility of the global model, federated learning systems must be reshaped into evolving entities with lifelong learning capability. However, realizing such continuous evolution is far from trivial, as it requires simultaneously addressing the following three mutually constrained challenges:

How can old knowledge efficiently adapt to new knowledge (related to ❶ and ❷)? As new devices and new data continuously arrive, data distributions often exhibit significant non-IID characteristics. The model must possess strong plasticity so that it can rapidly transfer previously learned knowledge to newly arrived knowledge and achieve effective adaptation.
How can the inflow of new knowledge avoid forgetting old knowledge (related to ❶ and ❷)? As the model continuously adapts to new tasks or distributions from new nodes, parameter updates can easily overwrite previously learned feature representations, causing a sharp performance drop on original node tasks or historical data distributions, namely catastrophic forgetting. How to consolidate existing memory while learning new knowledge is the cornerstone of stable continual evolution.
How can specific old knowledge be precisely removed (related to ❸ and ❹)? When users request data withdrawal or when maliciously contaminated data must be removed, the system must support unlearning. This requires the model to erase the influence of specific nodes or data from global parameters at an acceptable cost, without expensive retraining from scratch, thereby maintaining system availability while satisfying privacy and compliance requirements.

Solutions. To this end, existing studies mainly focus on three complementary technical routes. The first is Federated Domain Adaptation (FDA). This line of work focuses on the problem of how old knowledge can efficiently adapt to new knowledge. By effectively transferring rich knowledge accumulated in source domains to target domains, these methods help the model overcome the obstacles caused by non-IID data distributions, so that it can quickly adapt to new data distributions when facing newly joined nodes or new service scenarios, thereby enabling efficient integration of new knowledge and cold-start optimization. The second is Federated Continual Learning (FCL). This line of work aims to solve the problem of how to prevent forgetting old knowledge when new knowledge flows in. Through parameter regularization, knowledge distillation, replay mechanisms, and related strategies, FCL seeks to balance model plasticity and stability, ensuring that the model does not damage historical task memory representations while continuously absorbing knowledge from new tasks, thereby effectively mitigating catastrophic forgetting and preserving old knowledge over the long term. The third is Federated Unlearning (FU). This line of work focuses on how to precisely remove specific knowledge. In response to privacy withdrawal requests or malicious data cleansing needs, FU designs precise gradient correction or parameter slicing strategies so that the model can remove the contribution of specific data or nodes from global parameters at low cost, achieving selective deletion of old knowledge without retraining the model from scratch. These three problems are not independent but form a tightly coupled triangle of constraints (shown in Figure 2). First, adaptation triggers forgetting. When the model adapts to newly joined nodes via FDA, the parameter shifts inevitably risk overwriting representations learned from earlier tasks, directly creating the forgetting problem that FCL must solve. Second, continual learning complicates unlearning: as FCL accumulates knowledge from sequential tasks into shared parameters, the contribution of any single client becomes deeply entangled with subsequent updates, making precise removal far harder than in a static model. Third, unlearning disrupts adaptation: after FU removes a client’s contribution, the global model may lose transferable features that benefited other domains, requiring re-adaptation. Despite the above tight coupling, research that jointly addresses all three problems remains extremely scarce. Only a handful of studies have explored even pairwise intersections. For instance, Powder [4] investigates prompt-based dual knowledge transfer that simultaneously tackles domain shift and catastrophic forgetting, and FL-CLIP [5] bridges plasticity and stability by leveraging pretrained vision–language models for federated class-incremental adaptation. Both works, however, only address the intersection of FDA and FCL; no existing study simultaneously considers all three dimensions. Given this gap, this survey adopts a unified evolutionary framework to reveal the interdependencies among the three problems, while organizing the detailed technical review along each direction separately to provide a clear and systematic picture of the current research landscape.

Related Work and Contributions. To the best of our knowledge, a considerable number of surveys and review papers already exist on federated continual learning, federated unlearning, and federated domain adaptation. For example, [6] analyzes federated transfer learning from the knowledge inflow perspective, especially for system heterogeneity, data increment, and label scarcity; [7] discusses federated continual learning from the perspective of knowledge retention under dynamic evolution; and [8] reviews privacy erasure mechanisms in federated unlearning. However, these existing works are largely limited to a single perspective, i.e., they focus only on inflow or only on outflow, as summarized in Table 1, and thus lack a holistic view of the full lifecycle of model evolution in open environments. This paper aims to fill this gap. To our knowledge, it is the first survey that comprehensively reviews and evaluates federated model evolution in open environments from the bidirectional perspective of both inflow and outflow at the node and data levels. Main contributions are as follows:

Clarification of challenges and problems. We systematically analyze the key challenges of federated model evolution in open environments and explicitly formulate the complex evolution process into three core scientific problems: adapting old knowledge to new knowledge (how to learn), retaining old knowledge (how to remember), and deleting old knowledge (how to forget).
A systematic survey from a bidirectional perspective. We comprehensively review distributed model evolution techniques in open environments from two complementary dimensions, namely inflow and outflow. In particular, we summarize recent advances in federated domain adaptation (inflow perspective), federated continual learning (inflow perspective), and federated unlearning (outflow perspective), and reveal the intrinsic connections and differences among these technical routes.
Summary of evaluation systems and benchmarks. We systematically summarize the evaluation metrics used for federated collaborative evolution in open environments and carefully review mainstream benchmark datasets and experimental settings across different domains, providing an important reference for building standardized and scalable benchmark platforms for model evolution.

Organization. Section 3 reviews recent advances in federated domain adaptation. Section 4 discusses the problem of resisting forgetting of old knowledge. Section 2 decomposes the problem of distributed collaborative model evolution in open environments and introduces the basic concepts of FL, FDA, FCL and FU. Section 5 focuses on the deletion of specific memories and reviews three representative paradigms in federated unlearning. Section 6 summarizes evaluation metrics and common experimental settings for federated model evolution in open environments. Section 7 discusses future research directions, and the final section concludes the paper.

2. Preliminaries

This section first defines the key notations of distributed collaborative intelligent model evolution systems in open environments and analyzes the underlying problem. It then introduces the main techniques involved.

Distributed system architecture. Consider a federated learning system consisting of N distributed nodes. Each node

n \in [N]

maintains a local intelligent model

W_{n}

and a local dataset

D_{n}

. The global intelligent model is defined as

W^{g} = {W_{1}, W_{2}, \dots, W_{N}}

, representing the set of all node models. Under the federated learning framework, the global model parameters are obtained through an aggregation operation:

W^{g} = Aggregate ({W_{n}}_{n = 1}^{N}) .

(1)

Open-environment scenario decomposition. As illustrated in Figure 1, the model evolution problem in open federated environments can be decomposed into four subproblems:

Subproblem 1—Node inflow: When new nodes join, how can source-domain knowledge be transferred to improve target-domain performance without degrading existing clients? The core challenge is distribution heterogeneity (domain inconsistency and label-space discrepancy), requiring efficient alignment mechanisms for cold-start initialization.
Subproblem 2—Data inflow: Existing nodes continuously generate new data. The model must absorb incremental knowledge while preventing catastrophic forgetting of historical tasks, balancing plasticity and stability.
Subproblem 3—Node outflow: When a node leaves, requests privacy removal, or is identified as malicious, how can its influence be removed from the global model at low cost without damaging cross-node shared knowledge?
Subproblem 4—Data outflow: When a node requests forgetting of specific local samples due to user withdrawal, privacy erasure, or data expiration, how can the corresponding knowledge be efficiently removed from the model?

Subproblems 1 and 2 correspond to federated domain adaptation and federated continual learning, while Subproblems 3 and 4 correspond to federated unlearning. We next introduce each in turn.

2.1. Federated Learning (FL)

Learning Process

Federated learning (FL) is an emerging distributed machine learning paradigm whose central idea is to collaboratively optimize a global model while preserving user data privacy. An FL system consists of N clients

n \in [N]

and a central server. The FL workflow follows the principle of “data stay local, models move.” In each communication round t, the server broadcasts the current global model

W^{g} (t)

. Selected clients independently perform E rounds of local optimization, typically stochastic gradient descent (SGD), on their local dataset

D_{n}

, producing an updated local model

W^{n} (t + 1)

. The clients then upload only model parameters or gradient updates to the central server. Through an aggregation operation such as FedAvg, the server generates a new global model

W^{g} (t + 1)

, which is intended to synthesize the knowledge of all participants.

Training Objective

The core optimization objective of FL is to minimize the weighted sum of local losses across all clients, i.e., to solve a global loss minimization problem:

min_{W^{g}} L_{FL} (W^{g}) = \sum_{n = 1}^{N} \frac{| D_{n} |}{\sum_{k = 1}^{N} | D_{k} |} L_{n} (W^{g}; D_{n}),

(2)

where

L_{n} (W^{g}; D_{n})

denotes the empirical loss of the global model

W^{g}

on node n’s local dataset

D_{n}

. More specifically, let

l (\cdot, \cdot)

denote the loss function for a single data point; then the local empirical loss is defined as

L_{n} (W^{g}; D_{n}) = \frac{1}{| D_{n} |} \sum_{(x, y) \in D_{n}} l (f (x; W^{g}), y) .

(3)

The aggregation operation FedAvg updates the global model by weighted averaging:

W^{g} (t + 1) = \sum_{n = 1}^{N} \frac{| D_{n} |}{\sum_{k = 1}^{N} | D_{k} |} W^{n} (t + 1) .

(4)

Although FL provides the basic framework for distributed optimization, it is built on idealized assumptions of a static and closed environment. In realistic open and dynamic deployment environments, the FL objective

L_{FL}

only focuses on aggregating current overall performance and lacks inherent mechanisms for handling continuous data streams. This lack of robustness to dynamic changes is precisely the theoretical motivation behind the subsequent evolution paradigms of FCL, FU, and FDA.

2.2. Federated Domain Adaptation (FDA)

Learning Process

Federated domain adaptation (FDA) focuses on efficiently adapting old knowledge to new knowledge in open environments. Its goal is to effectively transfer rich knowledge accumulated in the source domain

D^{S}

to the target domain

D^{T}

, where the target domain typically represents newly joined nodes or existing nodes facing new tasks. By overcoming the barriers caused by data distribution heterogeneity (non-IID), FDA helps models quickly adapt to new data distributions, thereby enabling efficient integration of new knowledge and cold-start optimization.

Training Objective

The optimization objective of FDA is to leverage knowledge accumulated in the source domain to minimize the empirical loss on the target domain:

min_{W^{g}} L_{FDA} (W^{g}) = L_{t a r g e t} (W^{g}; D^{T}),

(5)

where

L_{t a r g e t} (W^{g}; D^{T})

denotes the empirical loss of the model on target-domain data. In data-inflow scenarios, FDA aims to improve the model’s performance on new service scenarios for existing nodes; in node-inflow scenarios, it focuses on improving the performance and cold-start speed of newly joined nodes.

2.3. Federated Continual Learning (FCL)

Learning Process

Federated continual learning (FCL) is designed to address catastrophic forgetting in federated environments. In open environments, new learning tasks and data streams

D^{t}

continuously emerge. FCL requires the model

W^{g}

to effectively retain memory of historical tasks

D^{h i s t} = ⋃_{τ = 0}^{t - 1} D^{t}

while continuously absorbing current knowledge. The learning process of FCL seeks an optimization strategy that balances model stability and plasticity, with a particular focus on preventing the forgetting of old knowledge.

Training Objective

At time t, the optimization objective of FCL is to minimize the loss on both the current task and all previous tasks with as little resource consumption as possible, thereby preventing catastrophic forgetting. This extends the FL training objective into a multi-objective optimization problem with historical-memory constraints:

\{\begin{matrix} min_{W^{g}} \sum_{s = 0}^{t - 1} L_{h i s t} (W^{g}; D^{s}) \\ min_{W^{g}} L_{n e w} (W^{g}; D^{t}) \\ min Re s o u r c e O v e r h e a d \end{matrix},

(6)

where

L_{n e w}

denotes the loss of

W^{g}

on current data, and

L_{h i s t}

denotes the loss on historical data. This objective can be divided into three sub-objectives: minimizing the global model loss on newly arrived data, minimizing the global model loss on all historical task data, and minimizing the resource overhead of the continual learning process. In data-inflow scenarios,

D^{t}

denotes newly generated data on existing nodes and

D^{s}

denotes their historical data; in node-inflow scenarios,

D^{t}

denotes data from newly joined nodes and

D^{s}

denotes all data from original nodes. The computation of

L_{n e w}

is given by

L_{n e w} (W^{g}; D^{t}) = \sum_{n = 1}^{N} \frac{| D_{n}^{t} |}{\sum_{k = 1}^{N} | D_{k}^{t} |} L_{n} (W^{g}; D_{n}^{t}) .

(7)

2.4. Federated Unlearning (FU)

Learning Process

Federated unlearning (FU) focuses on removing the influence of part of the training data and aims to solve the problem of how to delete specific memory. When the system receives requests to withdraw node data or remove the impact of maliciously contaminated data, the goal of FU is to rapidly generate a new model

W^{g^{'}}

. Ideally,

W^{g^{'}}

should approximate the ideal model

W^{*}

trained from scratch using only the retained set

D^{r e t a i n}

.

Training Objective

The core of FU is a multi-objective optimization problem that seeks to maximize the forgetting effect while minimizing interference with knowledge contained in the retained set

D^{r e t a i n}

. Since the ideal model

W^{*}

cannot be obtained without retraining, FU uses surrogate loss functions and seeks a model that remains stable on the retained set while behaving close to random on the forget set:

\{\begin{matrix} max_{W^{g}} L_{f o r g e t} (W^{g}; D_{f o r g e t}) \\ min_{W^{g}} L_{r e t a i n} (W^{g}; D_{r e t a i n}) \\ min Re s o u r c e O v e r h e a d \end{matrix},

(8)

where

D_{r e t a i n} = D_{t o t a l} ∖ D_{f o r g e t}

denotes the retained set.

L_{r e t a i n}

is the retained loss used to ensure that the generalization performance of

W^{g}

on retained data is preserved.

L_{f o r g e t}

is the forgetting loss, whose objective is to minimize the model’s performance on forgotten data, for example by maximizing classification error or output entropy on forgotten samples so as to achieve precise erasure of the corresponding knowledge. This optimization can be divided into three sub-objectives: maximizing the loss of the global model on forgotten data, minimizing the loss on retained data, and minimizing the resource overhead of the forgetting process. This training objective applies to both node-outflow and data-outflow scenarios. For the node-outflow case, it can be further refined as follows:

\{\begin{matrix} max_{W^{g}} L_{f o r g e t} (W^{g}; D_{f o r g e t}) \\ min_{W^{g}} L_{c r o s s} (W^{g}; D_{c r o s s}) \\ min_{W^{g}} L_{o t h e r} (W^{g}; D_{o t h e r}) \\ min_{W^{g}} Re s o u r c e O v e r h e a d \end{matrix},

(9)

where

L_{c r o s s}

denotes the total loss on knowledge shared between the forgotten node and the remaining nodes, and

L_{o t h e r}

denotes the loss on the remaining knowledge excluding such cross knowledge.

3. Federated Adaptation: Efficient Knowledge Transfer

In open and dynamic federated environments, system evolution relies on continuously admitting new participants or adapting to new service scenarios, corresponding to scenarios ❶ and ❷ in Figure 1. The key question is how to rapidly transfer and empower target domains with mature knowledge accumulated in source domains while strictly preserving privacy and physical isolation. This constitutes the central research problem of federated domain adaptation. Unlike traditional centralized domain adaptation, knowledge adaptation in federated environments cannot rely on simple data mixing or centralized fine-tuning. Owing to the inherent decentralized nature of federated systems, there exists a natural physical separation between source and target domains, which introduces the following two challenges:

Invisible source data: For privacy protection, source-domain data are strictly confined to local devices, causing the target domain to face a severe source-free adaptation challenge. New nodes cannot directly access source data for distribution matching.
Model heterogeneity: Nodes in open environments often employ heterogeneous models, making traditional parameter averaging ineffective. Designing efficient training strategies that enable rapid knowledge adaptation across heterogeneous models is therefore essential for efficient knowledge flow in complex systems.

Existing federated domain adaptation methods can be grouped into four categories, namely data alignment, feature alignment, model decoupling, and strategy optimization (illustrated in Figure 3). Data alignment operates at the input level, feature alignment at the intermediate representation level, model decoupling at the architectural level, and strategy optimization at the algorithmic level.

3.1. Data Alignment

In federated domain adaptation, source data remain local for privacy, preventing direct distribution matching. To enable knowledge transfer without accessing raw data, researchers mainly reconstruct virtual source domains from model artifacts and apply cross-domain augmentation to target data.

3.1.1. Virtual Domain Generation and Statistical Distribution Reconstruction

When source data are invisible, the source model becomes the sole repository of source-domain knowledge, and researchers attempt to recover or synthesize virtual samples by reverse-engineering its parameters and statistics. Generative adversarial networks (GANs) were among the earliest tools used for this purpose. Tang et al. [20] proposed virtual homogeneity learning, using StyleGAN with shared noise priors to generate virtual datasets that bridge client distribution gaps, and subsequent work extended GANs to pseudo-source reconstruction [21,22]. To reduce the computational burden of pixel-level generation, feature-level synthesis has been widely adopted: Li et al. [23] designed embedding generators to synthesize target-style data directly in feature space. More recently, diffusion models have replaced GANs as a mainstream paradigm owing to their superior generation quality and training stability. Yang et al. [24] and Chen et al. [25] used pretrained diffusion models to enable one-shot federated learning, while text-guided style transfer [26] further demonstrates their versatility.

Beyond generative models, a more lightweight route exploits batch normalization (BN) statistics. Liu et al. [27] directly used BN running mean and variance to constrain the generation process, and subsequent studies refined this idea through coarse-to-fine strategies combining BN statistics with Fourier transforms [28] or conditional generators that produce feature prototypes [29]. Yeh et al. [30] further reconstructed virtual source-domain distributions by analyzing classifier weights or building Gaussian mixture models. At the federated level, Guo et al. [31] discussed parameterizing global BN layers as distributions so that participants can recover augmented samples from noise.

3.1.2. Cross-Domain Mixup and Style Transfer

Instead of reconstructing source data, cross-domain augmentation modifies target-domain data by altering their style or mixing their content so that they mimic diverse distributions while preserving semantics. Mixup-based augmentation generates intermediate samples through linear interpolation to smooth decision boundaries, but directly mixing raw data across clients is prohibited in federated learning. Shin et al. [32] addressed this with XOR-based encoding for privacy-preserving cross-client mixing, and Yoon et al. [33] proposed FedMix, which approximates global mixup effects by averaging mini-batches during local iterations. As research advanced, mixup shifted from pixel space to feature space. Yang et al. [34] generated intermediate representations by randomly interpolating instance-level and global feature statistics, and Ding et al. [35] proposed ProxyMix, which builds class-balanced proxy source domains using prototype nearest neighbors and expands them through intra-domain mixup.

Style transfer aims to translate target-domain images into source-domain styles, thereby removing distribution shifts caused by non-semantic factors such as illumination or texture. Chen et al. [36] proposed CCST for transferring image styles across clients. To avoid the computational overhead of direct image manipulation, Zhou et al. [37] proposed MixStyle, which synthesizes new style features by combining feature statistics from different domains. Frequency-domain methods offer another perspective: Liu et al. [38] proposed FedDG, which uses Fourier transforms to decompose images into amplitude spectra (style) and phase spectra (content), achieving cross-domain augmentation at very low communication cost by exchanging amplitude spectra.

Remark. Virtual domain generation suits scenarios where source data are completely invisible and only model parameters or BN statistics are available; it trades computation for data but faces a fidelity–privacy trade-off, as higher-quality reconstructions increase model-inversion risks. Cross-domain mixup and style transfer are more appropriate when target-domain data can be augmented at low cost; they offer lightweight alignment but rely on the assumption that style and content are separable, which breaks down in domains such as medical imaging where texture carries diagnostic meaning.

3.2. Feature Alignment

Data-level alignment is constrained by pixel-space dimensionality and semantic gaps, while deeper knowledge integration occurs in feature space. Feature alignment seeks domain-invariant representations where source and target features become indistinguishable, enabling seamless knowledge transfer. This is achieved through two routes: explicit alignment using mathematical distance measures and implicit alignment via adversarial learning.

3.2.1. Explicit Alignment via Statistical Distances

The core logic of explicit alignment is to quantify distributional differences. Maximum mean discrepancy (MMD) [39] maps distributions into a reproducing-kernel Hilbert space and measures the distance between their mean embeddings. After the success of deep adaptation networks [40], this idea was quickly adopted in federated settings: Chen et al. [41] used MMD in D-WFA to measure distribution differences between clients and dynamically adjust aggregation weights. To capture finer-grained discrepancies, Peng et al. [42] proposed moment matching that aligns higher-order statistics beyond means, and Sun et al. [43] applied statistical moment matching in FedKA for domain generalization. However, MMD struggles when distributions have disjoint supports. Wasserstein distance addresses this by measuring the minimum transport cost between distributions, yielding meaningful gradients even without overlap. For example, Nguyen et al. [44] used Wasserstein distance to build distributionally robust federated optimization frameworks.

3.2.2. Adversarial Alignment via Discriminators

Inspired by generative adversarial networks, adversarial alignment abandons explicit distance measures and instead introduces a discriminator as an implicit judge. The feature extractor attempts to produce representations that confuse the discriminator, while the discriminator tries to distinguish whether a feature comes from the source or target domain. DANN [45] laid the foundation by synchronously training the feature extractor and domain classifier through a gradient reversal layer. In federated environments, Peng et al. [46] proposed FADA, which uses distributed client features as pseudo samples to fool a global discriminator, achieving cross-client alignment without exposing raw data. [47] proposed FedADG, introducing a reference distribution as an intermediate bridge for multi-source unification without pairwise adversarial synchronization. Another line of work focuses on samples near decision boundaries. Saito et al. [48] proposed maximum classifier discrepancy, using prediction disagreement between two classifiers to locate ambiguous target-domain samples. Xia et al. [49] introduced this into source-free adaptation through A2Net, constructing adversarial interaction so that target features fall inside the source classifier’s confidence region.

Remark. Explicit alignment via statistical distances suits scenarios requiring theoretical guarantees and stable optimization when distributions partially overlap; it provides mathematical rigor but typically aligns only marginal distributions, risking cross-class feature confusion under imbalanced labels. Adversarial alignment is more effective for complex nonlinear distribution shifts; it captures high-order structural differences but suffers from training instability in communication-constrained federated settings and often assumes shared label spaces, which fails under label shift or open-set scenarios. Neither alone handles all shift types.

3.3. Model Decoupling

If feature alignment unifies data representations in a shared space, model decoupling preserves domain-specific properties by separating shared and private knowledge. In federated domain adaptation, models face a trade-off between acquiring general knowledge from source domains and maintaining target-domain specificity. Model decoupling addresses this by physically or logically partitioning knowledge so shared and private components can evolve without interference.

3.3.1. Structural Decoupling and Feature Disentanglement

Structural decoupling opens the black box of feature extraction and explicitly decomposes input signals into domain-invariant content and domain-specific style through multi-branch architectures. Wu et al. [50] proposed COPA, which introduces structured separation into federated settings by splitting the model into an encoder for extracting generic features and a classifier for handling domain-specific tasks, together with a classifier-freezing strategy to prevent target-domain bias from contaminating generic features. Luo et al. [51] further designed a dual-branch architecture, with one branch extracting domain-invariant content features and the other modeling local style features such as lighting or background, enforcing statistical orthogonality between content and style through mutual-information minimization constraints. Wang et al. [52] combined disentanglement with uncertainty estimation to filter unreliable shared features while preserving local specificity. With the rise of generative models and large pretrained models, disentanglement has also moved into generative spaces. Ma et al. [53] proposed FedST, which uses diffusion models to separate structural and stylistic image information in latent space and alleviate cross-client style heterogeneity through generative style transfer. Under parameter-efficient fine-tuning, Bai et al. [54] proposed DiPrompT, a prompt-based lightweight decoupling framework that keeps the backbone frozen while learning global prompts for shared knowledge and domain prompts for domain-specific differences.

3.3.2. Parameter Decoupling and Layer-Wise Partitioning

Compared with implicit feature disentanglement, parameter decoupling provides a more direct engineering solution by physically partitioning network layers into shared bodies and private heads with differentiated aggregation strategies. FedPer [55] and FedRep [56] are foundational works in this area. They argue that shallow feature extractors should learn general visual patterns and therefore participate in global aggregation, whereas top classification heads should remain local for personalized decision boundaries. FedBABU [57] takes this idea further by updating only the shared body while freezing the head during federated training, and even discarding local classifier heads during aggregation, thereby preventing local classification bias from harming transferable feature learning. In contrast, LG-FedAvg [58] argues that in scenarios with highly heterogeneous low-level statistics, local feature extractors should remain personalized while high-level classifiers are shared globally. To reconcile such opposing views, Fed-ROD [59] introduces a dual-head mechanism maintaining both a generic and a personalized classification head, dynamically balancing generalization and personalization. FedCiR [2] instead operates in feature space and aligns client representations through calibrated projection heads.

Remark. Structural decoupling suits visual tasks where domain-invariant content and domain-specific style are meaningfully separable; it elegantly addresses what knowledge is transferable, but perfect orthogonal disentanglement is difficult in black-box federated settings and the content–style boundary is ambiguous in non-visual tasks. Parameter decoupling is more practical when label heterogeneity must be handled with minimal communication; it reduces transmission cost by sharing only partial parameters, but its effectiveness depends heavily on the partition choice, and hard splits may block useful gradient flow across layers. Neither alone resolves all heterogeneity challenges.

3.4. Strategy Optimization

Once the model architecture is fixed, the final piece of the puzzle is to design efficient training strategies that accelerate knowledge flow and integration. In federated domain adaptation, simple parameter averaging such as FedAvg often suffers from negative transfer or slow convergence because it ignores distribution differences and model heterogeneity across clients. Researchers have therefore rethought federated optimization from three angles: personalized aggregation, meta-learning, and federated distillation.

3.4.1. Optimization via Personalized Aggregation

Traditional federated aggregation often assumes that all participants contribute equally or proportionally to their data sizes. In heterogeneous open environments, however, knowledge from different source domains has very different value for the target domain. Personalized aggregation thus seeks mechanisms that quantify the affinity between source and target domains and dynamically adjust aggregation weights accordingly. The most direct approach measures affinity through statistical distances: Chen et al. [41] used MMD in D-WFA to measure feature-space distances between source and target domains, assigning larger weights to more similar clients, and Zhang et al. [60] further proposed generalization adjustment, which explicitly reweights the aggregation objective to promote domain generalization. However, statistical distances capture only distributional similarity and may miss deeper semantic relationships. Yuan et al. [61] therefore proposed CSAC, an attention-based mechanism that dynamically allocates cross-layer aggregation weights according to semantic similarity. Chen et al. [62] further introduced confidence and fairness as additional objectives in FedAWA, ensuring that aggregation not only maximizes transfer but also maintains reliability across unseen domains.

3.4.2. Fast Adaptation via Federated Meta-Learning

In open environments, new nodes and tasks are the norm. Federated meta-learning formalizes the goal of finding model initialization parameters that are highly sensitive to domain shifts so that the model can learn how to learn quickly. FedMeta [63] introduced MAML into federated learning by constructing support and query sets on clients to simulate domain adaptation episodes. This forces the global model to learn meta-knowledge so that it can quickly converge on new target domains with only a few gradient steps. Per-FedAvg [64] further established the theoretical link between meta-learning and personalization, showing that a suitable global initialization can be efficiently adapted to each client’s local optimum.

3.4.3. Knowledge Transfer via Federated Distillation

When participants use heterogeneous model architectures, parameter averaging breaks down completely. In this case, knowledge distillation offers an important path for crossing architectural gaps by transferring soft labels or intermediate responses instead of raw parameters. FedMD [65] proposed a heterogeneous federated distillation framework based on public datasets, where client predictions on shared public data are used as consensus signals. In scenarios without public data, FedGen2021 [66] and FedFTG [67] introduce lightweight generators that synthesize pseudo-samples carrying global distribution information, enabling data-free knowledge transfer. To avoid negative transfer from indiscriminate distillation, [68] identifies the importance of each teacher model for each sample and dynamically adjusts distillation weights according to relevance. [69] further proposes collaborative mutual distillation with knowledge filters that remove noisy predictions inconsistent with target-domain consensus.

Remark. Personalized aggregation suits scenarios where clients have heterogeneous data and the system needs adaptive knowledge fusion without architectural changes; its main challenge is computing similarity weights efficiently and privately. Federated meta-learning is most appropriate for dynamic edge scenarios with frequent cold starts requiring rapid few-shot adaptation; however, second-order derivatives make it computationally demanding on resource-constrained devices. Federated distillation best fits heterogeneous-architecture settings where parameter averaging is infeasible; it transfers dark knowledge through soft targets but depends on mediator data or generators prone to instability.

4. Federated Continual Learning: Mitigating Catastrophic Forgetting

In open and dynamic federated environments, as new data and new nodes continuously arrive, the model must absorb new knowledge through continual training, corresponding to scenarios ❶ and ❷ in Figure 1. This process is often accompanied by significant shifts in model parameters, which can severely reduce the global model’s memory of historical tasks and original nodes, namely catastrophic forgetting. Therefore, preventing catastrophic forgetting while learning new knowledge is one of the key challenges of model evolution in open environments. By combining continual learning with federated learning, federated continual learning has emerged as an important research direction for distributed settings. Unlike traditional centralized continual learning, federated continual learning faces several unique challenges:

Task heterogeneity: Different nodes operate in different environments, so the new data distributions they observe at the same time can differ substantially, leading to heterogeneous task evolution across clients.
Task asynchrony: Besides heterogeneous new-task distributions, the time when different nodes encounter new tasks may also differ, causing asynchronous task evolution across clients.

Current approaches to federated continual learning can be broadly categorized into alignment-based methods, rehearsal-based methods, architecture-based methods, and aggregation-based methods, as illustrated in Figure 4.

4.1. Alignment-Based Methods

In FCL, catastrophic forgetting is mainly caused by model drift [70,71]. As clients adapt to heterogeneous new tasks, feature extractors and decision boundaries shift away from the global consensus that supports historical tasks. Alignment-based methods mitigate this by enforcing consistency, compatibility, or orthogonality across knowledge from different sources and time steps. Existing solutions fall into three categories: feature-based alignment, gradient/parameter-based alignment, and output-based alignment.

4.1.1. Feature-Based Alignment

Feature-space alignment seeks to construct a shared and spatio-temporally robust feature space so that data from different nodes and different times maintain consistent geometry and topology. Prototypes are usually adopted in FCL to address catastrophic forggeting. For instance, FedTA [72] introduces a tail-anchor mechanism that keeps feature distributions close to initial global prototypes, thereby maintaining stable classification boundaries over time. FPPL [73] combines prompt learning with global prototypes by freezing the backbone, training only prompts, and using prototype constraints to calibrate the classifier, thus achieving parameter-efficient anti-forgetting. Unlike prototype methods that focus on class centers, contrastive learning explicitly preserves inter-class separability and intra-class compactness by pulling together positive pairs and pushing apart negative pairs. FedRCIL [74] extends this idea to incremental settings by aligning current models with historical-task models through contrastive losses. Hybrid alignment strategies that combine prototypes and contrastive learning have also emerged. For example, FedSpace [75] uses prototype-based distance losses to preserve old-class boundaries and contrastive representation losses to align feature distributions across old and new tasks.

4.1.2. Gradient- or Parameter-Based Alignment

While feature alignment focuses on intermediate representations, gradient- and parameter-based alignment directly acts on the weight space or optimization trajectories of neural networks, preventing parameters from deviating too much from already acquired knowledge structures. The most intuitive approach is important-parameter regularization, i.e., identifying parameters critical to old tasks and penalizing their drift during new training. FedCurv [76] introduced Fisher-based importance estimation into federated learning, and subsequent work reduced its cost by moving estimation to the server with a proxy dataset [77] or integrating local and global dynamics through personalized surrogate models [78]. However, as tasks accumulate, more parameters are locked and the model loses plasticity for new knowledge. Orthogonal gradient projection addresses this rigidity by constraining new-task updates to directions orthogonal to old-task gradients, thereby eliminating interference geometrically rather than suppressing it with penalties. DOLFIN [79] extracts dominant gradient subspaces via SVD at task boundaries for server-side aggregation, and FOT [80] further constructs a global projection matrix to overcome the local-view limitation. This principle has also been combined with prompt tuning [81] and frequency-domain separation [82]. Complementary to these constraint-based strategies, gradient correction methods take a different angle by directly adjusting update directions to offset client drift caused by non-IID data. FedAGC [83] pushes local updates toward the global Pareto frontier through asymmetric correction, while STAMP [84] designs spatio-temporal matching that approximates historical gradients via coresets at the client side and aligns cross-client gradients at the server side.

4.1.3. Output-Space Alignment

Unlike parameter or feature alignment, output-based alignment focuses directly on the model’s final behavior, namely its predictive probability distribution on a given input. The core idea is to construct a teacher-student paradigm in which an old model or a global model serves as the teacher and provides soft targets to guide the current model, so that the student preserves response patterns on old tasks while learning new ones. The theoretical foundation of this class of methods can be traced to LwF [85]. In federated environments, the first challenge is how to choose teachers. FLwF-2T [86] uses both a local old model and a global server model as teachers. The former preserves local history, while the latter transfers globally shared knowledge. FedMTL [87] similarly uses adaptive multi-teacher knowledge distillation to enhance cross-task knowledge fusion. Since these methods rely on new data to transfer old knowledge, they can be insufficient in class-incremental scenarios. To address this issue, some methods use unlabeled public data as bridges, such as MUFTI [88], while others use experience replay or generative replay to support distillation, such as FedCLASS [89] and GLFC [90]. As research progressed, dual-end collaborative distillation has been explored. CFeD [91] further deploys separate proxy datasets on both client and server sides to perform bidirectional knowledge distillation.

Remark. Feature-based alignment suits scenarios where explicit geometric constraints in embedding space are feasible; it provides strong anti-forgetting through representation-level regularization but requires additional memory and careful tuning. Gradient- or parameter-based alignment is more appropriate when client drift must be directly suppressed at the optimization level; it steers local paths toward global consistency but may require extra bandwidth and risk privacy leakage. Output-space alignment fits heterogeneous settings where only soft predictions can be exchanged; it tolerates model heterogeneity well but depends on teacher quality and struggles under low bandwidth.

4.2. Rehearsal-Based Methods

A straightforward way to alleviate catastrophic forgetting is rehearsal, namely allowing the model to revisit historical knowledge through some mechanism so as to maintain memory of past data distributions. These methods must carefully balance storage overhead, computation, and privacy. According to how historical knowledge is represented, they can be divided into experience replay and generative replay.

4.2.1. Experience Replay

The idea of rehearsal can be traced back to early studies such as those by Robins [92], and later became a mainstream continual learning paradigm through experience replay [93]. When moved to FCL, the research focus shifted from what to store to how to select efficiently under privacy and storage constraints. Good et al. [94] proposed a coordinated selection method based on gradient diversity, formulating sample selection as a distributed optimization problem to select replay subsets that best support global optimization under strict storage budgets. Re-Fed [95] introduced a personalized information model and uses gradient norms as importance indicators to select historical samples most beneficial for global optimization. For domain-incremental settings, SR-FDIL [96] combines local prototype distances and global discriminator scores to select representative replay samples under heterogeneous domains. Online FCL makes sample selection even more challenging because data arrive in streams and can be processed only once. Serra et al. [97] proposed Bregman-information-based sample selection using test-time augmentation to prioritize informative samples in memory. To avoid storing raw private data, researchers also explored non-raw replay forms. For example, FedPMR [98] combines replay with distillation by storing logits of sampled historical data.

4.2.2. Generative Replay

Because storing real user data may violate strict privacy requirements, generative replay trains generators to synthesize pseudo-samples resembling historical data distributions so that replay can proceed without raw data. The first wave of methods relied on GANs [99]. FedCIL [100] deploys client-side GANs to generate old-class data, but training a generator on every resource-constrained device is expensive. To shift this burden, MFCL [101] moves generator training to the server, using BN statistics from the global model, and broadcasts the trained generator back to clients—an idea shared by GLFC [90] and TARGET [102]. This server-centric paradigm has since been extended to streaming spatio-temporal tasks [103] and graph data with topological structure [104]. However, GANs suffer from training instability and mode collapse, motivating a second wave that bypasses explicit generators altogether. FedProk [105] generates pseudo-features directly from class prototypes, FCLPF [106] synthesizes old-class features via principal-component-based geometric offsets, and HGP [107] models feature distributions through hierarchical Gaussian prototypes for server-side classifier rebalancing. These prototype-based approaches are lightweight but limited in diversity. A third wave leverages the superior generation quality of diffusion models. Rather than training generators from scratch, diffusion-driven replay [108] searches for optimal prompts in pretrained latent diffusion models. DCFL [109] first explored conditional diffusion for replay, and FedDCL [110] combines pretrained diffusion models with dual-end distillation to further improve fidelity. Finally, recent work questions whether all old knowledge should be replayed indiscriminately. AF-FCL [111] uses normalizing flows and density-based weighting to selectively suppress low-density generated features that may correspond to noise, and CAN [112] exploits client expertise to guide generator training without relying on expensive external pretrained models.

Remark. Experience replay suits scenarios where storing a small subset of historical data is permissible and high replay fidelity is prioritized; it is highly effective but may violate privacy requirements and faces buffer saturation as tasks accumulate. Generative replay is more appropriate when strict privacy regulations prohibit raw-data storage; it offers in principle unbounded memory without storing real samples, but incurs high computational cost and risks hallucinated or low-quality synthetic samples. Neither alone satisfies all constraints: experience replay is limited by privacy and storage, while generative replay is limited by generation quality and computation.

4.3. Architecture-Based Methods

Whether alignment-based or replay-based, most methods still work under fixed model capacity. Architecture-based methods instead dynamically adjust model topology or isolate parameters so that different tasks occupy different physical representation spaces, fundamentally bypassing interference caused by parameter competition. Depending on whether the network physically grows during training, these methods can be divided into fixed-architecture and dynamic-architecture approaches.

4.3.1. Fixed Architecture

Fixed-architecture methods keep the network size unchanged but isolate tasks through masks, decomposition, or modular structures, with isolation granularity progressing from coarse to fine.

The earliest approach is subnetwork allocation through pruning. PackNet [113] pioneered this by iteratively pruning and assigning non-overlapping parameter subsets to different tasks. TagFed [114] extends this to FCL by pruning large models into task-specific subnetworks and using feature tracing to detect repeated tasks. FedFRR [115] softens the hard pruning boundary by decomposing the model into weighted mixtures of primitive network units and freezing those whose weights stabilize during aggregation. Rather than carving subnetworks from a monolithic model, a second direction explicitly decouples shared and task-specific modules. FedWeIT [116] decomposes weights into globally shared base parameters and sparse task-specific parameters and further uses attention to selectively borrow task knowledge from other clients. SacFL [3] separates the model into a robust encoder shared across tasks and lightweight task-sensitive decoders. With the rise of large pretrained models, prompt-based methods offer even finer-grained isolation by keeping the entire backbone frozen. FedMGP [117] uses global prompts for shared knowledge and local prompts for personalized knowledge, while Fed-CPrompt [118] introduces asynchronous prompt learning to support clients at different progress stages.

4.3.2. Dynamic Architecture

To avoid negative transfer caused by forced sharing in fixed backbones, dynamic-architecture methods physically add neurons, layers, or task-specific components so that old and new knowledge are isolated in different substructures. A direct strategy is to add new computation units as tasks increase. Venkatesha et al. [70] proposed a frozen-backbone plus inserted-layer strategy, using NetTailor to freeze a generic backbone and inserting proxy layers for individual tasks. Chen et al. [119] used multi-head architectures with one fully connected head per incremental task. FedCBC [120] trains one VAE per class and composes them modularly for selective knowledge fusion. Another effective route is to dynamically generate parameters using hypernetworks. FedDAH [121] introduces a dynamic allocation hypernetwork that maps task identities to model parameters, allowing task-specific parameter generation for asynchronous heterogeneous tasks across clients.

Remark. Fixed-architecture methods suit resource-constrained edge devices where model size must remain constant; they eliminate forgetting through parameter isolation but may impede positive transfer and face parameter-space exhaustion as tasks accumulate. Dynamic-architecture methods are more appropriate when the system can afford growing capacity and tasks exhibit strong mutual interference; they effectively eliminate interference by adding task-specific components but suffer from linear model-size growth that challenges scalability on edge devices. Neither alone resolves the tension between complete task isolation and efficient cross-task knowledge sharing.

4.4. Aggregation-Based Methods

Aggregation-based methods focus on server-side fusion strategies for combining heterogeneous incremental knowledge from different nodes. Unlike standard federated learning, where aggregation is usually simple parameter averaging, FCL aggregation must simultaneously handle spatial heterogeneity across clients and temporal asynchrony across tasks. Existing work mainly falls into optimization-based weighted aggregation and ensemble/analytical aggregation.

4.4.1. Optimization-Based Weighted Aggregation

This family of methods retains the parameter-aggregation paradigm but dynamically optimizes client selection and aggregation weights. FedAtt [122] uses adaptive layer-wise attention to compute distances between local and server models at each layer and thereby quantify client reliability. Clients whose models deviate strongly from the global distribution center receive lower weights. To handle extremely skewed distributions, some methods replace a single global model with clustered concept-specific models. Concept Matching [123] clusters client gradient streams and routes them to different concept models maintained at the server. In prompt-based large-model settings, Powder [4] introduces task-correlation matrices to quantify similarities across tasks and clients and aggregates only lightweight prompts accordingly.

4.4.2. Model Ensembling and Analytical Aggregation

To move beyond the theoretical limitations of parameter averaging, some studies explore model pools and analytical closed-form solutions. Ensemble-learning ideas have been introduced into FCL. LFedCon2 [124] maintains a global ensemble of the best local models at the server and selects them online through voting instead of averaging parameters directly. Other work seeks to bypass gradient-based optimization entirely. LoRM [125] formulates aggregation under parameter-efficient fine-tuning as a linear regression problem based on Gram matrices and solves it in closed form. AFCL [126] goes even further and derives gradient-free analytical updates using frozen pretrained features and recursive global aggregation, proving spatio-temporal invariance under certain conditions. In serverless settings, aggregation can even become topology-based knowledge diffusion, as in FedPC [127], which allows peers to exchange weights directly in vehicular networks.

Remark. Optimization-based weighted aggregation suits scenarios where a FedAvg-style framework is already deployed and heterogeneous distributions require adaptive weighting; it reduces gradient conflicts and integrates easily, but still relies on linear combinations that may fail under severe geometric conflicts. Model ensembling and analytical aggregation are more appropriate when parameter averaging is fundamentally inadequate; they substantially improve anti-forgetting ability but incur high storage and inference overhead or expensive matrix computation. Neither alone resolves the tension between lightweight deployment and strong anti-forgetting guarantees.

5. Federated Unlearning: Selective Knowledge Removal

In federated learning systems operating in open environments, models do not only accumulate knowledge in one direction. Knowledge removal is just as important as knowledge acquisition. In practice, due to malicious-node detection, data correction, user privacy withdrawal such as the right to be forgotten, and intellectual-property compliance, some historical data, samples, classes, or even entire client nodes must be effectively removed from the global model, corresponding to scenarios ❸ and ❹ in Figure 1. This need has led to the rapidly growing field of federated unlearning, whose objective is to make the global model behave approximately as if the designated data had never been learned, while still respecting federated privacy constraints and system efficiency. Distributed architectures bring unique challenges to unlearning:

Node unlearning. Unlike centralized unlearning, which usually focuses on samples or classes, distributed settings must additionally address node-level unlearning, i.e., removing the influence of an entire client’s data from the global model.
Contribution entanglement. In FL, individual data contributions are embedded within aggregated parameters through iterative averaging, making it difficult to isolate specific data or client influence. Unlike centralized unlearning, federated unlearning must disentangle target contributions without accessing raw data, creating tension between unlearning completeness and privacy constraints.

Therefore, federated unlearning is not a simple extension of centralized unlearning. It requires systematic redesigns of training mechanisms and parameter-updating processes under unique constraints. Existing methods can be broadly grouped into three categories: retraining-based methods, model-adjustment-based methods, and contribution-reversal-based methods, as illustrated in Figure 5.

5.1. Retraining-Based Methods

Retraining-based methods are the most traditional and semantically strict family in federated unlearning. Their basic idea is that once an unlearning request is received, the model is retrained so that all influence of the forgotten data on model parameters is completely removed. According to the retraining scope, these methods can be divided into full retraining and partial retraining.

5.1.1. Full Retraining

Full retraining is the most direct and conceptually strict unlearning strategy. After receiving an unlearning request, the system completely removes the contribution of the target data or client and retrains the entire model using only the retained data, producing a model statistically equivalent to one that has never seen the forgotten data. Since all model parameters are relearned from the retained data, this strategy provides the strongest theoretical guarantee of unlearning and ensures that no residual influence remains. It was first formalized as a basic strategy in machine unlearning [128] and is widely regarded as the gold standard for exact forgetting. However, despite its strict guarantees, full retraining is prohibitively expensive in federated settings because all remaining clients must repeat the full local-training and communication process [129]. In practice, client availability and willingness to participate in repeated retraining may be limited, so full retraining often fails to meet real system availability requirements. Nevertheless, it remains an ideal upper bound and a baseline for evaluating more efficient unlearning methods. To reduce the burden of full retraining, several accelerated variants have been proposed. Liu et al. [130] introduced a rapid retraining approach based on quasi-Newton approximations, diagonal empirical Fisher information, and momentum correction to quickly converge along the trajectory induced by deleting target data. Halimi et al. [131] proposed a client-removal method in which local unlearning is first performed on the target client, and the resulting model is then used as initialization for a small number of federated rounds with the remaining clients. Zhang et al. [132] proposed BMT and MMT, which maintain local models in parallel with the global model so that, upon an unlearning request, the server can discard the current global model and update from the remaining local models without restarting from scratch.

5.1.2. Partial Retraining

Partial retraining avoids retraining all clients and instead retrains only affected model components, data shards, or a limited subset of participating clients. This significantly reduces both training and communication overhead while preserving unlearning semantics. The key insight is that if training can be organized so that contributions are partially isolated, then forgetting can be localized to the affected partition alone.

The foundational idea comes from centralized machine unlearning: Bourtoule et al. [128] proposed SISA, which partitions training data into independent shards and retrains only the affected shard upon deletion. Translating this to federated settings requires defining what constitutes a “shard” in a distributed system. Pan et al. [133] answer this for federated clustering by designing decomposable statistics so that deleting one client only requires updating the affected cluster, and Liu et al. [134] further strengthen this with SecAgg+ for privacy under adversarial users. Moving beyond statistical decomposability, Su and Li [135] proposed Knot, which clusters clients by training speed and update characteristics so that forgetting one client requires retraining only its cluster while others continue asynchronously. Lin et al. [136] push isolation further through coded computation: only the shard containing the forgotten client is retrained, while remaining shards are recovered using coding redundancy. In vertical federated learning, the natural feature partitioning across parties provides an even finer isolation boundary: Wang et al. [137] exploit this so that only parties holding relevant features need to retrain.

Remark. Full retraining suits scenarios requiring strict theoretical guarantees of forgetting correctness; it provides the strongest unlearning semantics but is prohibitively expensive in federated environments, confining it mostly to an ideal baseline. Partial retraining is better suited to large-scale systems where only a subset of clients or components are affected; it localizes unlearning effects and greatly reduces overhead, but its effectiveness depends on the independence of subunits and degrades as inter-subunit coupling increases. Neither alone satisfies all deployment constraints, motivating more lightweight approaches.

5.2. Model Adjustment-Based Methods

Unlike retraining-based methods, model-adjustment-based methods try to weaken, cancel, or erase the influence of forgotten data by modifying model parameters or structures without, or with only minimal, retraining. They can be divided into parameter-oriented and structure-oriented strategies.

5.2.1. Parameter-Oriented Methods

Parameter-oriented methods directly weaken, cancel, or reconstruct the contribution of forgotten data in parameter space through explicit update rules, gradient reversal, importance assessment, or contribution modeling. Early parameter-oriented methods mainly focused on directly suppressing historical gradient traces. Zhao et al. [138] proposed momentum degradation (MoDe), which weakens the accumulated influence of forgotten clients in the global model by adjusting momentum terms and introducing an auxiliary randomly initialized model trained on retained data. Later methods moved toward explicit contribution estimation. FedRecovery [139] quantifies the cumulative influence of target clients through gradient residuals and removes them from parameter space with calibrated Gaussian noise for statistical indistinguishability. VERIFI [140] additionally introduces verification by suppressing target gradients during aggregation and marking model states so that forgetting can be audited. More fine-grained methods include FFMU [141], which smooths local gradients through nonlinear function approximation; ConDA [142], which tracks parameter contributions from each client and selectively weakens sensitive updates; and SFU [143], which removes target influence by projecting affected gradients into orthogonal subspaces.

Because many of these approaches require servers to store complete historical updates, a key practical challenge is memory efficiency. Three complementary strategies have emerged to address this bottleneck at different levels. At the storage level, MEDU [144] compresses client updates hierarchically during training and reconstructs approximate histories on demand for later forgetting. At the representation level, FERRARI [145] sidesteps historical storage entirely by minimizing feature sensitivity associated with target data, achieving forgetting through representation-level optimization rather than gradient rollback. At the structural level, FedSSU [146] explores channel-level selective updates that confine forgetting operations to relevant network channels. Orthogonally, Starfish [147] addresses the privacy dimension by combining two-party computation with parameter calibration, ensuring that the forgetting process itself does not leak additional information.

5.2.2. Structure-Oriented Methods

Structure-oriented methods achieve forgetting by modifying the model structure itself. Their key idea is to identify and remove structural units inside the model that carry information about the forgotten data, such as channels, feature dimensions, memory modules, or representation subspaces. Early work mainly mapped forgetting to targeted removal or erasure of channels and modules. Wang et al. [148] proposed class-discriminative pruning, which quantifies channel importance for target classes through activation patterns and prunes the most class-discriminative channels. Xia et al. [149] proposed FedME2, which explicitly projects model memory onto multi-layer feature maps and then performs controlled erasure guided by memory evaluation and similarity constraints. Structure-oriented ideas have also been generalized beyond CNNs. FedLU [150] handles federated knowledge graph embeddings by defining structural units as entity–relation embeddings and then applying retroactive interference and passive decay to erase forgotten triples while recovering retained knowledge.

More recent studies operate at an even finer granularity. Pan et al. [151] proposed feature-level forgetting in vertical federated learning by estimating the sensitivity of each feature to model loss and then combining weighted gradient updates, adaptive noise scaling, and gradient projection. Han et al. [152] treated forgotten data as potential backdoor sources and isolated suspicious feature interactions. FedWiper [153] partitions the global model into multiple submodels and uses general adapters so that forgetting can be localized to only the relevant submodel. FUSED [154] takes this idea further by introducing sparse adapters into sensitive layers. In this way, forgetting becomes the removal of plug-and-play components rather than modification of the whole backbone.

Remark. Parameter-oriented methods suit settings needing efficient forgetting without full retraining; they provide fast approximate forgetting through gradient reversal or contribution estimation, but accuracy depends on stored histories and errors may accumulate under strong non-IID conditions. Structure-oriented methods are more appropriate when forgetting must be interpretable and localizable to specific channels, features, or subspaces; they make forgetting modular and transparent, but their success depends on how well structural units correspond to data semantics. Neither alone handles all scenarios: parameter methods struggle with fine-grained semantic isolation, while structure methods face challenges when knowledge is deeply distributed across the network.

5.3. Contribution-Reversal-Based Methods

Contribution-reversal-based methods infer how target data contributed to the training dynamics or final model behavior and then construct inverse contributions to cancel that effect. Depending on how the inverse contribution is realized, these methods can be divided into replay, update reversal, and generative reconstruction.

5.3.1. Replay

Replay-based federated unlearning replays or partially replays historical training behavior so as to overwrite or cancel the influence of forgotten data. Instead of directly tracing parameter contributions, these methods simulate alternative training trajectories that approximate a model trained without the target data. The central tension is between replay fidelity and storage cost: caching more history enables more faithful trajectory reconstruction but imposes prohibitive memory overhead.

FedEraser [155] pioneered this direction by caching all client updates during training so that, during unlearning, the server can replay the remaining clients’ gradient trajectories and reconstruct an equivalent history without the forgotten participant. FAST [156] reduces storage by selectively replaying only server-side aggregated updates, complemented by a small sample set for post-unlearning recovery. QuickDrop [157] pushes compression further by distilling client gradient contributions into lightweight synthetic data through gradient matching, so that replay requires only a few gradient steps on these surrogates. When even compressed histories are unavailable, FedCF [158] takes a reconstruction approach, extracting effective knowledge from remaining-client updates and building temporary local models to fill in for removed contributions.

An alternative design pushes replay to the client side, removing the server’s storage burden entirely. FRU [159] stores selected historical updates locally, while FedUNRAN [160] takes a more radical approach by locally retraining with random labels to induce incorrect memory, thereby achieving forgetting without any faithful history at all. Application-specific recovery variants such as CGKD [161] further tailor client-side replay to domain needs.

5.3.2. Update Reversal

Update-reversal methods directly construct parameter updates opposite to the contribution of forgotten data or forgotten clients so as to cancel their influence on the current model. Early work studied this idea under simplified assumptions. F2L2 [162] focuses on linear models and convex optimization, characterizing the effect of samples or clients analytically and applying exact parameter corrections. Exact-FUN [163] generalized this idea to broader federated settings and formalized conditions under which corrected models can match retraining on retained data. From a Bayesian perspective, BFU [164] treat unlearning as posterior correction under distribution-consistency principles. FATS [165] formulates exact federated unlearning through total-variation stability, enabling theoretically guaranteed sample- and client-level forgetting.

For deep models and realistic federated systems, FedU [166] approximates user-side influence functions and explicitly balances utility preservation with influence cancellation. SIFU [167] further introduces sequential informed unlearning by identifying training points whose contributions exceed a forgetting budget and then applying inverse corrections. To reduce storage cost, Fast-FedUL [168] and CFRU [169] selectively store key historical gradients and construct approximate inverse corrections with theoretical bias bounds. Update-reversal methods have also been extended to vertical federated learning, such as ICO [170], which explicitly constructs inverse correction terms for feature-collaboration settings.

5.3.3. Generative Reconstruction

Generative-reconstruction-based methods use synthetic data, pseudo-labels, soft outputs, or representation transformations to reconstruct model behavior and gradually remove the influence of forgotten data without explicitly replaying historical updates. A foundational federated unlearning framework based on knowledge distillation was proposed by Wu et al. [129]. Later work explored more proactive reconstruction of pseudo memories. FedAF [171] generates semantically meaningless signals and combines them with teacher–student distillation so that new signals overwrite old memories. Similar ideas have been used for class-level forgetting [172]. FedBT [173] introduces a teacher that excludes target-client knowledge and uses orthogonal gradient constraints to guide the global model away from forgotten knowledge. RAFU [174] frames unlearning as selective knowledge reconstruction and uses reverse distillation together with simultaneous unlearning and filtering.

Remark. Replay-based methods suit scenarios where partial historical update records are available and moderate storage overhead is acceptable; they provide stable forgetting with small direct model changes but may accumulate approximation error under repeated forgetting. Update-reversal methods are most appropriate when theoretical guarantees are required and sufficient gradient histories are maintained; they offer clear optimization interpretations but struggle to balance correction accuracy and system cost under strong heterogeneity. Generative-reconstruction methods are advantageous when no historical states are stored; they lower system-maintenance cost by avoiding explicit rollback but depend heavily on the quality of surrogate supervision and distillation stability. None alone simultaneously satisfies low storage, strict guarantees, and robustness to heterogeneity.

6. Evaluations and Experimental Settings

Evaluating federated model evolution in open environments requires more than accuracy on static test sets. Models must learn new information, retain prior knowledge, forget specified data, and operate within communication, computation, storage, and privacy constraints. Thus, evaluation should be multidimensional and use experimental settings that reflect dynamic, heterogeneous environments. Below, we summarize key metrics and settings for FDA, FCL, and FU.

6.1. Federated Domain Adaptation

6.1.1. Evaluations

FDA evaluates whether source-domain knowledge can be transferred to target domains under privacy and heterogeneity constraints. Its critical capability requirements include adaptation effect, adaptation speed, and resource consumption.

❶ Adaptation effect. This dimension first assesses task-level utility through metrics matched to the prediction task: target-domain accuracy [41,46] for classification, IoU/Dice [27,38] for dense prediction, unseen-domain accuracy [36,38,47] for domain generalization, and worst-case target risk [44] for robustness under distributional uncertainty. These task metrics are complemented by alignment diagnostics that reveal how well distributions are matched: MMD [39,41] quantifies kernel mean discrepancy, Wasserstein distance [44] captures geometric mismatch, domain-discriminator accuracy [45,46] tests whether domains remain distinguishable, and maximum classifier discrepancy [48,49] reflects decision-boundary disagreement on target samples.

❷ Adaptation speed. This dimension measures how quickly a target client reaches usable performance. Rounds to target accuracy [63,64] and time to threshold [63] provide deployment-oriented stopping criteria. Convergence slope or area under the accuracy–round curve [63,64] captures early-stage learning quality beyond final accuracy alone. For personalized adaptation, local steps to personalization [64] offers finer granularity, while one-shot/few-shot target accuracy [25] is appropriate when target supervision or communication is extremely limited.

❸ Resource consumption. This dimension quantifies what must be exchanged, computed, or stored to enable domain adaptation. Communication rounds [46,63] capture synchronization cost, while the type of transmitted information—parameters/logits [65], BN statistics or source summaries [27], or style/frequency statistics [36,38]—distinguishes different privacy-preserving carriers of domain knowledge. Data-free or one-shot protocols should additionally report generated pseudo-source samples [25,66], local computation/FLOPs [63,64], and generator or proxy-data memory [25,66], as these methods trade raw-data access for synthesis and storage overhead.

6.1.2. Experimental Settings

FDA protocols should specify source-target construction, target supervision, source-data accessibility, client/domain topology, label/feature-space relation, and adaptation budget. These elements determine whether the task is source-free, unsupervised, few-shot, homogeneous, heterogeneous, or communication-limited. They should be reported separately from the method taxonomy, because data alignment, feature alignment, decoupling, and strategy optimization can all be evaluated under several of these protocols.

❶ Source-target construction and topology. The protocol should state whether the setting involves single or multiple sources and single or multiple targets. For example, FADA [46] represents multi-source adaptation with distributed feature alignment. In contrast, FedDG [38], CCST [36], and FedADG [47] generalize to entirely unseen domains.

❷ Target supervision and source accessibility. The protocol should specify whether target labels are absent, few-shot, semi-supervised, or fully available, and separately whether source data, pretrained models, BN statistics, prototypes, or generated pseudo-sources are accessible. Distinguishing these factors is essential for fair comparison. For instance, BN-statistic source-free segmentation [27] and diffusion-based one-shot adaptation [25] operate under fundamentally different assumptions.

❸ Homogeneity, heterogeneity, and budget. The protocol should state whether clients share feature spaces, label spaces, and model architectures. When models or modalities are heterogeneous, parameter averaging is no longer a valid baseline and distillation-based approaches become necessary [65]. Budgets-including local epochs, participation ratio, communication rounds, transmitted objects, and target-sample counts—must be fixed, since speed and resource metrics are otherwise incomparable across methods. Intermediate or synthetic domains, such as FedDG’s continuous frequency-space interpolation [38], should be treated as protocol details rather than method categories.

6.2. Federated Continual Learning

6.2.1. Evaluations

FCL evaluates whether a federated model can learn over sequential classes, domains, or tasks while remaining useful to old and new clients. Its core capabilities are plasticity, stability, and resource consumption.

❶ Plasticity. Plasticity measures the model’s ability to absorb new knowledge at each increment. At the stage level, metrics like current-task/new-class accuracy [90,91,100,101] verify whether the latest increment is genuinely learned rather than merely shielded by anti-forgetting mechanisms. At the stream level, average incremental accuracy [90,100,101] aggregates performance across all stages, avoiding overemphasis on the final checkpoint alone. When task identity is unavailable at inference, final all-seen-class accuracy [73,117] assesses whether new and old classes can coexist within a unified decision space. For prompt-based or parameter-efficient FCL, task-specific prompt performance [117,118] provides additional insight, as the learnable capacity is deliberately constrained.

❷ Stability. Stability assesses whether historical knowledge remains intact after new increments. The most direct temporal metrics are forgetting rate [79,80,90,100,101], which compares a task’s peak historical performance with its subsequent degradation, and backward transfer (BWT) [79,80], which quantifies whether later learning stages benefit or harm earlier ones. For more interpretable reporting, old-task/old-class accuracy [72,73,76] and average accuracy over all seen tasks [115,116] offer direct retention measures. Because forgetting in federated settings may be unevenly distributed across clients, accuracy degradation under non-IID conditions [72,76] should also be reported to expose client-specific drift that global averages may conceal.

❸ Resource consumption. Resource metrics should be evaluated under the same incremental schedule to ensure comparability. Communication rounds [73,90] measure synchronization overhead, while transmitted bytes or uploaded objects [94,101] distinguish full-model exchange from lighter alternatives such as prototypes, gradients, or summaries. Client-side feasibility is captured by local training time/FLOPs [73,125]. Memory-related costs should be disaggregated into memory-buffer size [94], generator storage/training cost [101], and parameter or module growth [115,116,125], as each mechanism trades a different hidden budget for improved stability.

6.2.2. Experimental Settings

Existing studies commonly consider class-incremental, domain-incremental, and task-incremental settings, among which class increment dominates.

❶ Class increment. The label space expands and inference usually lacks task identity. Datasets such as CIFAR-100 or ImageNet-100 are split into class groups and distributed across IID or non-IID clients; each stage should be evaluated on all seen classes. GLFC [90], CFeD [91], FedCIL [100], and MFCL [101] use such protocols to test new-class learning and old-class boundary retention.

❷ Domain increment. The label space is shared but the input distribution shifts. Protocols should define domain order, client-specific domain sequences, and whether domain identities are known. FedWeIT [116] evaluates sequential datasets, while SR-FDIL [96] studies replay selection under heterogeneous domains, making this setting suitable for long-term drift.

❸ Task increment. Each stage is a distinct task, so protocols must state whether task IDs are available at inference. This distinction is important because known-task inference permits task-specific heads or prompts, while unknown-task inference becomes close to class-incremental learning. Multi-head or expansion methods [70,119] fit known-task settings, whereas prompt methods such as FedMGP [117] and Fed-CPrompt [118] evaluate lightweight task-specific parameters.

6.3. Federated Unlearning

6.3.1. Evaluations

FU evaluates whether specified knowledge can be removed without unnecessarily damaging retained knowledge. Its core capabilities include knowledge deletion, knowledge retention, safety, and resource consumption.

❶ Knowledge deletion ability. This dimension verifies whether the targeted contribution has been effectively removed. Output-level metrics directly probe model behavior on deleted data: forget-set accuracy/loss [131,155,168] evaluates predictions on samples or clients requested for deletion, target-class accuracy [148,172,175] applies to class-level unlearning, and target confidence/entropy [154,167] reveals whether the model still assigns confident predictions to forgotten targets. However, degraded target performance may result from global model damage rather than selective deletion. Reference-based metrics address this by comparing the unlearned model against a retrained-from-scratch baseline via parameter distance [130,155], prediction/logit discrepancy [131,139], prediction-distribution divergence [139], and trajectory similarity [140].

❷ Knowledge retention ability. Retention ability measures whether non-forgotten knowledge remains useful. Overall utility is measured by retained-set/global-test accuracy [129,131,155]. Collateral cost is quantified through utility drop from the original model [139,168] and utility gap to the retrained reference [128,130]. In FL, worst-client accuracy or client-wise variance [131,168] exposes localized degradation hidden by global averages. For class-level deletion, retained-class accuracy [148,172,175] checks whether semantically related classes are unintentionally damaged.

❸ Safety. This dimension examines whether unlearning is verifiable and resistant to adversarial probes. For auditability, verification success rate [140] and false acceptance/rejection rate [140] assess whether deletion evidence is reliable. For malicious-target scenarios, backdoor attack success rate [152,154,176] and trigger accuracy [152,176] indicate whether poisoned behavior has been eliminated. For privacy-driven requests, membership-inference attack success [147] and reconstruction/inversion risk [177] test whether deleted information remains extractable.

❹ Resource consumption. Resource consumption measures whether deletion can be performed after deployment. Response time is captured by unlearning latency [155,168] and extra communication rounds [139,155]. Computation cost is reflected by local retraining epochs [128,129], retained-client participation ratio [131], and communication bytes [129,139]. Storage should be reported separately as checkpoint or cached-update memory [144,155] and correction/adapter memory [139,154], since fast unlearning often shifts cost from retraining to historical-state maintenance.

6.3.2. Experimental Settings

Existing studies mainly consider three granularities of deletion requests: client unlearning, class unlearning, and sample unlearning.

❶ Client unlearning. The system trains on all clients, removes one or a few clients, and compares with a no-target-client reference under IID and non-IID splits. The selected client ratio and whether the target is random or malicious should be explicit. FedEraser [155], Halimi et al. [131], and Wu et al. [129] follow this logic; malicious-client variants also evaluate backdoor removal [154].

❷ Class unlearning. One or more labels are deleted. Protocols should state whether forgotten classes are spread across clients or concentrated locally, and report forgotten-class degradation, retained-class accuracy, and semantic confusion. Class-discriminative pruning [148], class-wise memory generation [172], and class-aware representation transformation [175] fit this setting.

❸ Sample unlearning. Individual local samples are removed. Protocols should state deletion ratio, random or sensitivity-based selection, and whether poisoned or mislabeled samples are included. Since the target signal is small, forget-set accuracy can be noisy; retraining consistency, confidence change, and privacy probes are often more useful. SIFU [167] and FUSED [154] use such fine-grained or malicious-sample settings.

7. Future Research Directions

Although initial progress has been made in federated collaborative model evolution, several promising directions deserve further exploration:

Concept-drift-aware evolution in dynamic non-stationary environments. A core feature of open environments is unpredictable statistical change, namely concept drift. Existing FCL methods often assume that drift is known or abrupt. Future systems must instead automatically detect and robustly handle continuously evolving data streams without prior knowledge. Such systems should be able to distinguish temporary noise from persistent drift and, for recurring drift, retrieve suitable historical model states from memory rather than relearning from scratch.

Co-evolution of attacks and defenses for forgetting. Future work should not only ask how to forget but also how to prevent forgotten knowledge from being recovered. Unlearning inversion attacks are emerging threats, where attackers try to reconstruct deleted data from differences between pre- and post-unlearning models. Future federated evolution frameworks therefore need differential-privacy-aware forgetting mechanisms that statistically bound residual information and make recovery no more likely than random guessing.

Cross-modal collaborative evolution. Given the modality heterogeneity of edge nodes in open environments, future federated evolution must go beyond single-modality settings and support collaborative evolution across videos, time-series signals, text, and other data sources. This calls for methods that project heterogeneous modalities into a unified semantic space for evolution and transfer. The research perspective should also be extended to federated agents that continuously learn and collaborate under dynamic tasks.

Continual learning for intelligent agents. As autonomous agents become widespread in edge computing, IoT, and autonomous driving, static models cannot handle continuously changing task streams and state spaces. Future work should integrate federated continual learning with multi-agent collaborative learning to balance shared knowledge and personalized decision making, adapt collaboratively to dynamic task streams, and preserve long-term memory without exposing private trajectories.

Benchmark platforms for open-environment distributed model evolution. Existing federated learning benchmarks are mostly based on static and closed assumptions and cannot reflect the dynamic inflow and outflow of knowledge in open environments. There is an urgent need for standardized, scalable, and multimodal benchmark platforms that support dynamic node joining and leaving, streaming distribution drift, asynchronous multi-task increments, and compliant forgetting requests, together with unified protocols and multi-dimensional metrics such as plasticity, stability, forgetting correctness, and communication efficiency.

8. Conclusion

This paper reviews continuous federated model evolution in open environments from inflow and outflow perspectives, focusing on three challenges: adapting to new nodes and data distributions, preserving historical performance, and deleting designated knowledge. We summarize federated domain adaptation, continual learning, and unlearning, together with their mechanisms, evaluation metrics, and benchmarks. Overall, these methods must balance plasticity, stability, and compliant deletion. To the best of our knowledge, this is the first survey that systematically reviews the key technologies of model evolution in open federated environments. Despite progress, open problems remain in concept-drift handling, cross-modal evolution, etc., calling for standardized and scalable frameworks for long-term deployment in IoT and edge intelligence.

References

McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. In Proceedings of the Artificial intelligence and statistics. PMLR, 2017; pp. 1273–1282. [Google Scholar]
Li, Z.; Lin, Z.; Shao, J.; Mao, Y.; Zhang, J. FedCiR: Client-Invariant Representation Learning for Federated Non-IID Features. IEEE TMC 2024, 23, 10509–10522. [Google Scholar] [CrossRef]
Zhong, Z.; Bao, W.; Wang, J.; Chen, J.; Lyu, L.; Lim, W.Y.B. SacFL: Self-Adaptive Federated Continual Learning for Resource-Constrained End Devices. IEEE TNNLS, 2025. [Google Scholar]
Piao, H.; Wu, Y.; Wu, D.; Wei, Y. Federated continual learning via prompt-based dual knowledge transfer. In Proceedings of the ICML, 2024. [Google Scholar]
Tan, A.Z.; Feng, S.; Yu, H. Fl-clip: Bridging plasticity and stability in pre-trained federated class-incremental learning models. In Proceedings of the 2024 IEEE International Conference on Multimedia and Expo (ICME); IEEE, 2024; pp. 1–6. [Google Scholar]
Guo, W.; Zhuang, F.; Zhang, X.; Tong, Y.; Dong, J. A comprehensive survey of federated transfer learning: challenges, methods and applications. Front. Comput. Sci. 2024, 18, 186356. [Google Scholar] [CrossRef]
Yang, X.; Yu, H.; Gao, X.; Wang, H.; Zhang, J.; Li, T. Federated continual learning via knowledge fusion: A survey. IEEE TKDE 2024, 36, 3832–3850. [Google Scholar] [CrossRef]
Liu, Z.; Jiang, Y.; Shen, J.; Peng, M.; Lam, K.Y.; Yuan, X.; Liu, X. A survey on federated unlearning: Challenges, methods, and future directions. ACM Comput. Surv. 2024, 57, 1–38. [Google Scholar] [CrossRef]
Wang, L.; Zhang, X.; Su, H.; Zhu, J. A comprehensive survey of continual learning: Theory, method and application. IEEE TPAMI 2024, 46, 5362–5383. [Google Scholar] [CrossRef]
Zhou, D.W.; Wang, Q.W.; Qi, Z.H.; Ye, H.J.; Zhan, D.C.; Liu, Z. Class-incremental learning: A survey. IEEE TPAMI, 2024. [Google Scholar]
Shi, H.; Xu, Z.; Wang, H.; Qin, W.; Wang, W.; Wang, Y.; Wang, Z.; Ebrahimi, S.; Wang, H. Continual learning of large language models: A comprehensive survey. ACM Comput. Surv. 2025, 58, 1–42. [Google Scholar] [CrossRef]
Zhou, D.W.; Sun, H.L.; Ning, J.; Ye, H.J.; Zhan, D.C. Continual learning with pre-trained models: a survey. In Proceedings of the IJCAI, 2024; pp. 8363–8371. [Google Scholar]
Yu, D.; Zhang, X.; Chen, Y.; Liu, A.; Zhang, Y.; Yu, P.S.; King, I. Recent advances of multimodal continual learning: A comprehensive survey. arXiv 2024, arXiv:2410.05352. [Google Scholar] [CrossRef] [PubMed]
Li, Y.; Wang, H.; Xu, W.; Xiao, T.; Liu, H.; Tu, M.; Wang, Y.; Yang, X.; Zhang, R.; Yu, S.; et al. Unleashing the power of continual learning on non-centralized devices: A survey. IEEE Communications Surveys & Tutorials, 2025. [Google Scholar]
Nguyen, T.T.; Huynh, T.T.; Ren, Z.; Nguyen, P.L.; Liew, A.W.C.; Yin, H.; Nguyen, Q.V.H. A survey of machine unlearning. ACM TIST. 2025, 16, 1–46. [Google Scholar] [CrossRef]
Li, N.; Zhou, C.; Gao, Y.; Chen, H.; Zhang, Z.; Kuang, B.; Fu, A. Machine unlearning: Taxonomy, metrics, applications, challenges, and prospects. IEEE TNNLS; 2025. [Google Scholar]
Li, J.; Yu, Z.; Du, Z.; Zhu, L.; Shen, H.T. A comprehensive survey on source-free domain adaptation. IEEE TPAMI 2024, 46, 5743–5762. [Google Scholar] [CrossRef]
Li, Y.; Wang, X.; Zeng, R.; Kumar Donta, P.; Murturi, I.; Huang, M.; Dustdar, S. Federated Domain Generalization: A Survey. Proc. IEEE 2025, 113, 370–410. [Google Scholar] [CrossRef]
Fang, Y.; Yap, P.T.; Lin, W.; Zhu, H.; Liu, M. Source-free unsupervised domain adaptation: A survey. Neural Netw. 2024, 174, 106230. [Google Scholar] [CrossRef] [PubMed]
Tang, Z.; Zhang, Y.; Shi, S.; He, X.; Han, B.; Chu, X. Virtual Homogeneity Learning: Defending against Data Heterogeneity in Federated Learning. In Proceedings of the ICML. PMLR, 2022; pp. 21111–21132. [Google Scholar]
Sariyildiz, M.B.; Cinbis, R.G. Gradient Matching Generative Networks for Zero-Shot Learning. In Proceedings of the CVPR, 2019; pp. 2168–2178. [Google Scholar]
Kurmi, V.K.; Subramanian, V.K.; Namboodiri, V.P. Domain Impression: A Source Data Free Domain Adaptation Method. In Proceedings of the WACV, 2021; pp. 615–625. [Google Scholar]
Li, R.; Jiao, Q.; Cao, W.; Wong, H.S.; Wu, S. Model Adaptation: Unsupervised Domain Adaptation Without Source Data. In Proceedings of the CVPR, 2020; pp. 9638–9647. [Google Scholar] [CrossRef]
Yang, M.; Su, S.; Li, B.; Xue, X. Exploring One-Shot Semi-Supervised Federated Learning with Pre-Trained Diffusion Models. Proc. AAAI. AAAI Press 2024, Vol. 38(AAAI’24/IAAI’24/EAAI’24), 16325–16333. [Google Scholar] [CrossRef]
Chen, H.; Li, H.; Zhang, Y.; Bi, J.; Zhang, G.; Zhang, Y.; Torr, P.; Gu, J.; Krompass, D.; Tresp, V. FedBiP: Heterogeneous One-Shot Federated Learning with Personalized Latent Diffusion Models. In Proceedings of the CVPR, 2025; pp. 30440–30450. [Google Scholar]
Wang, G.; Zhu, Y.; Luo, G. DACOA: Diffusion-Aligned Coherent Augmentation and Consistency Constraint Strategies for Federated Domain Generalization. In Pattern Recognition; Antonacopoulos, A., Chaudhuri, S., Chellappa, R., Liu, C.L., Bhattacharya, S., Pal, U., Eds.; Springer Nature Switzerland: Cham, 2025; Vol. 15327, pp. 176–191. [Google Scholar] [CrossRef]
Liu, Y.; Zhang, W.; Wang, J. Source-Free Domain Adaptation for Semantic Segmentation. In Proceedings of the CVPR, 2021; pp. 1215–1224. [Google Scholar]
Yang, C.; Guo, X.; Chen, Z.; Yuan, Y. Source Free Domain Adaptation for Medical Image Segmentation with Fourier Style Mining. Med. Image Anal. 2022, 79, 102457. [Google Scholar] [CrossRef]
Qiu, Z.; Zhang, Y.; Lin, H.; Niu, S.; Liu, Y.; Du, Q.; Tan, M. Source-Free Domain Adaptation via Avatar Prototype Generation and Adaptation. Proc. IJCAI 2021, Vol. 3, 2921–2927. [Google Scholar] [CrossRef]
Yeh, H.W.; Yang, B.; Yuen, P.C.; Harada, T. SoFA: Source-Data-Free Feature Alignment for Unsupervised Domain Adaptation. In Proceedings of the WACV, 2021; pp. 474–483. [Google Scholar]
Guo, W.; Zhuang, F.; Zhang, X.; Tong, Y.; Dong, J. A Comprehensive Survey of Federated Transfer Learning: Challenges, Methods and Applications. Front. Comput. Sci. 2024, 18, 186356. [Google Scholar] [CrossRef]
Shin, M.; Hwang, C.; Kim, J.; Park, J.; Bennis, M.; Kim, S.L. XOR Mixup: Privacy-Preserving Data Augmentation for One-Shot Federated Learning. arXiv 2020, arXiv:cs. [Google Scholar] [CrossRef]
Yoon, T.; Shin, S.; Hwang, S.J.; Yang, E. FedMix: Approximation of Mixup under Mean Augmented Federated Learning. In Proceedings of the ICLR. OpenReview.net, 2021. [Google Scholar]
YANG, S.; CHOI, S.; PARK, H.; CHOI, S.; Yun, S. Client-Agnostic Learning and Zero-Shot Adaptation for Federated Domain Generalization. US US20240112039A1, 2024. [Google Scholar]
Ding, Y.; Sheng, L.; Liang, J.; Zheng, A.; He, R. ProxyMix: Proxy-based Mixup Training with Label Refinery for Source-Free Domain Adaptation. Neural Netw. 2023, 167, 92–103. [Google Scholar] [CrossRef]
Chen, J.; Jiang, M.; Dou, Q.; Chen, Q. Federated Domain Generalization for Image Recognition via Cross-Client Style Transfer. In Proceedings of the CVPR, 2023; pp. 361–370. [Google Scholar]
Zhou, K.; Yang, Y.; Qiao, Y.; Xiang, T. Domain Generalization with MixStyle. In Proceedings of the ICLR. OpenReview.net, 2021. [Google Scholar]
Liu, Q.; Chen, C.; Qin, J.; Dou, Q.; Heng, P.A. FedDG: Federated Domain Generalization on Medical Image Segmentation via Episodic Learning in Continuous Frequency Space. In Proceedings of the CVPR, 2021; pp. 1013–1023. [Google Scholar]
Gretton, A.; Borgwardt, K.; Rasch, M.; Schölkopf, B.; Smola, A. A Kernel Method for the Two-Sample-Problem. In Proceedings of the NeurIPS, 2006; MIT Press; Vol. 19. [Google Scholar]
Long, M.; Cao, Y.; Wang, J.; Jordan, M. Learning Transferable Features with Deep Adaptation Networks. In Proceedings of the ICML. PMLR, 2015; pp. 97–105. [Google Scholar]
Chen, J.; Li, J.; Huang, R.; Yue, K.; Chen, Z.; Li, W. Federated Transfer Learning for Bearing Fault Diagnosis With Discrepancy-Based Weighted Federated Averaging. IEEE Trans. Instrum. Meas. 2022, 71, 1–11. [Google Scholar] [CrossRef]
Peng, X.; Bai, Q.; Xia, X.; Huang, Z.; Saenko, K.; Wang, B. Moment Matching for Multi-Source Domain Adaptation. In Proceedings of the ICCV, 2019; pp. 1406–1415. [Google Scholar]
Sun, Y.; Chong, N.; Ochiai, H. Feature Distribution Matching for Federated Domain Generalization. In Proceedings of the Proceedings of The 14th Asian Conference on Machine Learning. PMLR, 2023; pp. 942–957. [Google Scholar]
Nguyen, T.A.; Nguyen, T.D.; Le, L.T.; Dinh, C.T.; Tran, N.H. On the Generalization of Wasserstein Robust Federated Learning. arXiv 2022, arXiv:cs. [Google Scholar] [CrossRef]
Ganin, Y.; Ustinova, E.; Ajakan, H.; Germain, P.; Larochelle, H.; Laviolette, F.; March, M.; Lempitsky, V. Domain-Adversarial Training of Neural Networks. J. Mach. Learn. Res. 2016, 17, 1–35. [Google Scholar]
Peng, X.; Huang, Z.; Zhu, Y.; Saenko, K. Federated Adversarial Domain Adaptation. In Proceedings of the ICLR. OpenReview.net, 2020. [Google Scholar]
Zhang, L.; Lei, X.; Shi, Y.; Huang, H.; Chen, C. Federated Learning for IoT Devices With Domain Generalization. IEEE IoTJ 2023, 10, 9622–9633. [Google Scholar] [CrossRef]
Saito, K.; Watanabe, K.; Ushiku, Y.; Harada, T. Maximum Classifier Discrepancy for Unsupervised Domain Adaptation. In Proceedings of the CVPR, 2018; pp. 3723–3732. [Google Scholar]
Xia, H.; Zhao, H.; Ding, Z. Adaptive Adversarial Network for Source-Free Domain Adaptation. In Proceedings of the ICCV, 2021; pp. 9010–9019. [Google Scholar]
Wu, G.; Gong, S. Collaborative Optimization and Aggregation for Decentralized Domain Generalization and Adaptation. In Proceedings of the ICCV, 2021; pp. 6464–6473. [Google Scholar] [CrossRef]
Luo, Z.; Wang, Y.; Wang, Z.; Sun, Z.; Tan, T. Disentangled Federated Learning for Tackling Attributes Skew via Invariant Aggregation and Diversity Transferring. In Proceedings of the ICML. PMLR, 2022; pp. 14527–14541. [Google Scholar]
Wang, M.; Yu, K.; Feng, C.M.; Qian, Y.; Zou, K.; Wang, L.; Goh, R.S.M.; Xu, X.; Liu, Y.; Fu, H. Reliable Federated Disentangling Network for Non-IID Domain Feature. IEEE Trans. Big Data 2025, 11, 648–658. [Google Scholar] [CrossRef]
Ma, B.; Yin, X.; Tan, J.; Chen, Y.; Huang, H.; Wang, H.; Xue, W.; Ban, X. FedST: Federated Style Transfer Learning for Non-IID Image Segmentation. Proc. AAAI 2024, Vol. 38, 4053–4061. [Google Scholar] [CrossRef]
Bai, S.; Zhang, J.; Guo, S.; Li, S.; Guo, J.; Hou, J.; Han, T.; Lu, X. DiPrompT: Disentangled Prompt Tuning for Multiple Latent Domain Generalization in Federated Learning. Proc. CVPR 2024, 27274–27283. [Google Scholar] [CrossRef]
Arivazhagan, M.G.; Aggarwal, V.; Singh, A.K.; Choudhary, S. Federated Learning with Personalization Layers. arXiv 2019, arXiv:cs. [Google Scholar] [CrossRef]
Collins, L.; Hassani, H.; Mokhtari, A.; Shakkottai, S. Exploiting Shared Representations for Personalized Federated Learning. In Proceedings of the ICML. PMLR, 2021; pp. 2089–2099. [Google Scholar]
Oh, J.; Kim, S.; Yun, S.Y. FedBABU: Toward Enhanced Representation for Federated Image Classification. In Proceedings of the ICLR. OpenReview.net, 2022. [Google Scholar]
Liang, P.P.; Liu, T.; Ziyin, L.; Allen, N.B.; Auerbach, R.P.; Brent, D.; Salakhutdinov, R.; Morency, L.P. Think Locally, Act Globally: Federated Learning with Local and Global Representations. arXiv 2020, arXiv:cs. [Google Scholar] [CrossRef]
Chen, H.Y.; Chao, W.L. On Bridging Generic and Personalized Federated Learning for Image Classification. In Proceedings of the ICLR. OpenReview.net, 2022. [Google Scholar]
Zhang, R.; Xu, Q.; Yao, J.; Zhang, Y.; Tian, Q.; Wang, Y. Federated Domain Generalization With Generalization Adjustment. In Proceedings of the CVPR, 2023; pp. 3954–3963. [Google Scholar]
Yuan, J.; Ma, X.; Chen, D.; Wu, F.; Lin, L.; Kuang, K. Collaborative Semantic Aggregation and Calibration for Federated Domain Generalization. IEEE TKDE 2023, 35, 12528–12541. [Google Scholar] [CrossRef]
Chen, Y.; He, N.; Sun, L. FedAWA: Aggregation Weight Adjustment in Federated Domain Generalization. In Proceedings of the ICIP, 2024; pp. 451–457. [Google Scholar] [CrossRef]
Chen, F.; Luo, M.; Dong, Z.; Li, Z.; He, X. Federated Meta-Learning with Fast Convergence and Efficient Communication; 2018. [Google Scholar] [CrossRef]
Fallah, A.; Mokhtari, A.; Ozdaglar, A. Personalized Federated Learning with Theoretical Guarantees: A Model-Agnostic Meta-Learning Approach. Proc. NeurIPS 2020, Vol. 33, 3557–3568. [Google Scholar]
Li, D.; Wang, J. FedMD: Heterogenous Federated Learning via Model Distillation; 2019. [Google Scholar] [CrossRef]
Zhu, Z.; Hong, J.; Zhou, J. Data-Free Knowledge Distillation for Heterogeneous Federated Learning. In Proceedings of the ICML. PMLR, 2021; pp. 12878–12889. [Google Scholar]
Zhang, L.; Shen, L.; Ding, L.; Tao, D.; Duan, L.Y. Fine-Tuning Global Model via Data-Free Knowledge Distillation for Non-IID Federated Learning. In Proceedings of the CVPR, 2022; pp. 10164–10173. [Google Scholar]
Wang, H.; Li, Y.; Xu, W.; Li, R.; Zhan, Y.; Zeng, Z. DaFKD: Domain-aware Federated Knowledge Distillation. In Proceedings of the CVPR, 2023; pp. 20412–20421. [Google Scholar] [CrossRef]
Niu, Z.; Wang, H.; Sun, H.; Ouyang, S.; Chen, Y.w.; Lin, L. MCKD: Mutually Collaborative Knowledge Distillation For Federated Domain Adaptation And Generalization. In Proceedings of the ICASSP, Rhodes Island, Greece, 2023; pp. 1–5. [Google Scholar] [CrossRef]
Venkatesha, Y.; Kim, Y.; Park, H.; Li, Y.; Panda, P. Addressing client drift in federated continual learning with adaptive optimization. Available at SSRN 4188586 2022. [Google Scholar]
Hamedi, P.; Razavi-Far, R.; Hallaji, E. Federated Continual Learning: Concepts, Challenges, and Solutions. arXiv 2025, arXiv:2502.07059. [Google Scholar] [CrossRef]
Yu, H.; Yang, X.; Zhang, L.; Gu, H.; Li, T.; Fan, L.; Yang, Q. Handling spatial-temporal data heterogeneity for federated continual learning via tail anchor. In Proceedings of the CVPR, 2025; pp. 4874–4883. [Google Scholar]
He, Y.; Shen, C.; Wang, X.; Jin, B. Fppl: an efficient and non-iid robust federated continual learning framework. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData). IEEE, 2024; pp. 3692–3701. [Google Scholar]
Psaltis, A.; Chatzikonstantinou, C.; Patrikakis, C.Z.; Daras, P. FedRCIL: Federated Knowledge Distillation for Representation based Contrastive Incremental Learning. In Proceedings of the ICCV. IEEE, 2023; pp. 3455–3464. [Google Scholar]
Shenaj, D.; Toldo, M.; Rigon, A.; Zanuttigh, P. Asynchronous federated continual learning. In Proceedings of the CVPR, 2023; pp. 5055–5063. [Google Scholar]
Shoham, N.; Avidor, T.; Keren, A.; Israel, N.; Benditkis, D.; Mor-Yosef, L.; Zeitak, I. Overcoming forgetting in federated learning on non-iid data. arXiv 2019, arXiv:1910.07796. [Google Scholar] [CrossRef]
Yao, X.; Sun, L. Continual local training for better initialization of federated models. In Proceedings of the ICIP. IEEE, 2020; pp. 1736–1740. [Google Scholar]
Li, Y.; Wang, Y.; Wang, H.; Qi, Y.; Xiao, T.; Li, R. FedSSI: Rehearsal-Free Continual Federated Learning with Synergistic Synaptic Intelligence. In Proceedings of the ICML, 2025. [Google Scholar]
Moussadek, O.; Salami, R.; Calderara, S. DOLFIN: Balancing Stability and Plasticity in Federated Continual Learning. arXiv 2025, arXiv:2510.13567. [Google Scholar] [CrossRef]
Bakman, Y.F.; Yaldiz, D.N.; Ezzeldin, Y.H.; Avestimehr, S. Federated Orthogonal Training: Mitigating Global Catastrophic Forgetting in Continual Federated Learning. In Proceedings of the ICLR, 2024. [Google Scholar]
Ke, H.; Shi, J.; Zhang, Y.; Wang, F.; Xie, Y.; Qu, Y. Task-Aware Prompt Gradient Projection for Parameter-Efficient Tuning Federated Class-Incremental Learning. Proc. ICCV 2025, 2631–2641. [Google Scholar]
Zhang, Y.; Zhu, H.; Tan, A.Z.; Yu, D.; Huang, L.; Yu, H. pFedMxF: Personalized Federated Class-Incremental Learning with Mixture of Frequency Aggregation. Proc. CVPR 2025, 30640–30650. [Google Scholar]
Zhang, C.; Shang, F.; Liu, H.; Wan, L.; Feng, W. FedAGC: Federated Continual Learning with Asymmetric Gradient Correction. In Proceedings of the ICCV, 2025; pp. 3841–3850. [Google Scholar]
Nguyen, M.D.; Nguyen, L.T.; Pham, Q.V. Improving Generalization in Heterogeneous Federated Continual Learning via Spatio-Temporal Gradient Matching with Prototypical Coreset. arXiv 2025, arXiv:2506.12031. [Google Scholar] [CrossRef]
Li, Z.; Hoiem, D. Learning without forgetting. IEEE TPAMI 2017, 40, 2935–2947. [Google Scholar] [CrossRef] [PubMed]
Usmanova, A.; Portet, F.; Lalanda, P.; Vega, G. A distillation-based approach integrating continual learning and federated learning for pervasive services. arXiv 2021, arXiv:2109.04197. [Google Scholar] [CrossRef]
Chen, L.; Zhao, D.; Gao, Y.; Zhou, J.; Wei, T.C. FedMTL: Adaptive Multi-Teacher Knowledge Distillation for Federated Continual Learning. KBS 2025, 115160. [Google Scholar]
Gai, K.; Wang, Z.; Yu, J.; Zhu, L. Mufti: Multi-domain distillation-based heterogeneous federated continuous learning. IEEE TIFS, 2025. [Google Scholar]
Wu, Z.; He, T.; Sun, S.; Wang, Y.; Liu, M.; Gao, B.; Jiang, X. Federated class-incremental learning with new-class augmented self-distillation. arXiv 2024, arXiv:2401.00622. [Google Scholar] [CrossRef]
Dong, J.; Wang, L.; Fang, Z.; Sun, G.; Xu, S.; Wang, X.; Zhu, Q. Federated class-incremental learning. In Proceedings of the CVPR, 2022; pp. 10164–10173. [Google Scholar]
Ma, Y.; Xie, Z.; Wang, J.; Chen, K.; Shou, L. Continual Federated Learning Based on Knowledge Distillation. In Proceedings of the IJCAI, 2022; pp. 2182–2188. [Google Scholar]
Robins, A. Catastrophic forgetting, rehearsal and pseudorehearsal. Connect. Sci. 1995, 7, 123–146. [Google Scholar] [CrossRef]
Rolnick, D.; Ahuja, A.; Schwarz, J.; Lillicrap, T.; Wayne, G. Experience replay for continual learning. NeurIPS 2019, 32. [Google Scholar]
Good, J.; Majmudar, J.; Dupuy, C.; Wang, J.; Peris, C.; Chung, C.; Zemel, R.; Gupta, R. Coordinated replay sample selection for continual federated learning. arXiv 2023, arXiv:2310.15054. [Google Scholar] [CrossRef]
Li, Y.; Li, Q.; Wang, H.; Li, R.; Zhong, W.; Zhang, G. Towards efficient replay in federated incremental learning. In Proceedings of the CVPR, 2024; pp. 12820–12829. [Google Scholar]
Li, Y.; Xu, W.; Qi, Y.; Wang, H.; Li, R.; Guo, S. Sr-fdil: Synergistic replay for federated domain-incremental learning. IEEE TPDS 2024, 35, 1879–1890. [Google Scholar] [CrossRef]
Serra, G.; Buettner, F. Federated Continual Learning Goes Online: Uncertainty-Aware Memory Management for Vision Tasks and Beyond. In Proceedings of the ICLR, 2025. [Google Scholar]
Wang, Z.; Zhang, Y.; Xu, X.; Fu, Z.; Yang, H.; Du, W. Federated probability memory recall for federated continual learning. Inf. Sci. 2023, 629, 551–565. [Google Scholar] [CrossRef]
Rasouli, M.; Sun, T.; Rajagopal, R. Fedgan: Federated generative adversarial networks for distributed data. arXiv 2020, arXiv:2006.07228. [Google Scholar] [CrossRef]
Qi, D.; Zhao, H.; Li, S. Better generative replay for continual federated learning. ICLR, 2023. [Google Scholar]
Babakniya, S.; Fabian, Z.; He, C.; Soltanolkotabi, M.; Avestimehr, S. A data-free approach to mitigate catastrophic forgetting in federated class incremental learning for vision tasks. NeurIPS 2023, 36, 66408–66425. [Google Scholar]
Zhang, J.; Chen, C.; Zhuang, W.; Lyu, L. Target: Federated class-continual learning via exemplar-free distillation. In Proceedings of the ICCV, 2023; pp. 4782–4793. [Google Scholar]
Miao, H.; Zhao, Y.; Guo, C.; Yang, B.; Zheng, K.; Jensen, C.S. Spatio-temporal prediction on streaming data: A unified federated continuous learning framework. IEEE TKDE; 2025. [Google Scholar]
Zhu, Y.; Hu, M.; Wu, D. Federated continual graph learning. Proc. ACM SIGKDD 2025, 4203–4213. [Google Scholar]
Gao, X.; Yang, X.; Yu, H.; Kang, Y.; Li, T. Fedprok: Trustworthy federated class-incremental learning via prototypical feature knowledge transfer. In Proceedings of the CVPR, 2024; pp. 4205–4214. [Google Scholar]
Yoo, M.K.; Park, Y.R. Federated class incremental learning: A pseudo feature based approach without exemplars. In Proceedings of the Proceedings of the Asian Conference on Computer Vision, 2024; pp. 488–498. [Google Scholar]
Salami, R.; Buzzega, P.; Mosconi, M.; Verasani, M.; Calderara, S. Federated class-incremental learning with hierarchical generative prototypes. arXiv 2024, arXiv:2406.02447. [Google Scholar]
Liang, J.; Zhong, J.; Gu, H.; Lu, Z.; Tang, X.; Dai, G.; Huang, S.; Fan, L.; Yang, Q. Diffusion-driven data replay: A novel approach to combat forgetting in federated class continual learning. In Proceedings of the ECCV. Springer, 2024; pp. 303–319. [Google Scholar]
Mei, Y.; Yuan, L.; Han, D.J.; Chan, K.S.; Brinton, C.G.; Lan, T. Using Diffusion Models as Generative Replay in Continual Federated Learning–What will Happen? arXiv 2024, arXiv:2411.06618. [Google Scholar]
Zhang, X.; Chen, Z.; Yuan, Y.; Zou, Y.; Zhuang, F.; Jiao, W.; Wang, Y.; Yu, D. Data-Free Continual Learning of Server Models in Model-Heterogeneous Federated learning. arXiv 2025, arXiv:2509.25977. [Google Scholar]
Wuerkaixi, A.; Cui, S.; Zhang, J.; Yan, K.; Han, B.; Niu, G.; Fang, L.; Zhang, C.; Sugiyama, M. Accurate forgetting for heterogeneous federated continual learning. arXiv 2025, arXiv:2502.14205. [Google Scholar] [CrossRef]
Rong, X.; Zhang, J.; He, K.; Ye, M. CAN: Leveraging Clients As Navigators for Generative Replay in Federated Continual Learning. In Proceedings of the ICML, 2025. [Google Scholar]
Mallya, A.; Lazebnik, S. Packnet: Adding multiple tasks to a single network by iterative pruning. In Proceedings of the CVPR, 2018; pp. 7765–7773. [Google Scholar]
Wang, Q.; Liu, B.; Li, Y. Traceable federated continual learning. In Proceedings of the CVPR, 2024; pp. 12872–12881. [Google Scholar]
Wang, H.; Sun, J.; Wo, T.; Liu, X. FedFRR: Federated Forgetting-Resistant Representation Learning. In Proceedings of the ICME. IEEE, 2024; pp. 1–6. [Google Scholar]
Yoon, J.; Jeong, W.; Lee, G.; Yang, E.; Hwang, S.J. Federated continual learning with weighted inter-client transfer. In Proceedings of the ICML. PMLR, 2021; pp. 12073–12086. [Google Scholar]
Yu, H.; Yang, X.; Gao, X.; Kang, Y.; Wang, H.; Zhang, J.; Li, T. Personalized federated continual learning via multi-granularity prompt. In Proceedings of the ACM SIGKDD, 2024; pp. 4023–4034. [Google Scholar]
Bagwe, G.; Yuan, X.; Pan, M.; Zhang, L. Fed-CPrompt: Contrastive Prompt for Rehearsal-Free Federated Continual Learning. In Proceedings of the Federated Learning and Analytics in Practice: Algorithms, Systems, Applications, and Opportunities, 2023. [Google Scholar]
Chen, C.; Kevin, I.; Wang, K.; Li, P.; Sakurai, K. Flexibility and privacy: A multi-head federated continual learning framework for dynamic edge environments. In Proceedings of the CANDAR. IEEE, 2023; pp. 1–10. [Google Scholar]
Yu, H.; Yang, X.; Gao, X.; Feng, Y.; Wang, H.; Kang, Y.; Li, T. Overcoming spatial-temporal catastrophic forgetting for federated class-incremental learning. In Proceedings of the ACM MM, 2024; pp. 5280–5288. [Google Scholar]
Qi, X.; Zhang, J.; Fu, H.; Yang, G.; Li, S.; Jin, Y. Dynamic Allocation Hypernetwork with Adaptive Model Recalibration for Federated Continual Learning. In Proceedings of the International Conference on Information Processing in Medical Imaging, 2025; Springer; pp. 342–356. [Google Scholar]
Thwal, C.M.; Tun, Y.L.; Kim, K.; Park, S.B.; Hong, C.S. Transformers with attentive federated aggregation for time series stock forecasting. In Proceedings of the ICOIN. IEEE, 2023; pp. 499–504. [Google Scholar]
Jiang, X.; Borcea, C. Concept matching: clustering-based federated continual learning. arXiv 2023, arXiv:2311.06921. [Google Scholar] [CrossRef]
Casado, F.E.; Lema, D.; Iglesias, R.; Regueiro, C.V.; Barro, S. Federated and continual learning for classification tasks in a society of devices. arXiv 2020, arXiv:2006.07129. [Google Scholar]
Salami, R.; Buzzega, P.; Mosconi, M.; Bonato, J.; Sabetta, L.; Calderara, S. Closed-Form Merging of Parameter-Efficient Modules for Federated Continual Learning. In Proceedings of the ICLR, 2025. [Google Scholar]
Tang, J.; Zhuang, H.; He, J.; He, R.; Wang, J.; Fan, K.; Liu, A.; Wang, T.; Wang, L.; Zhu, Z.; et al. AFCL: Analytic Federated Continual Learning for Spatio-Temporal Invariance of Non-IID Data. arXiv 2025, arXiv:2505.12245. [Google Scholar]
Yuan, L.; Ma, Y.; Su, L.; Wang, Z. Peer-to-peer federated continual learning for naturalistic driving action recognition. In Proceedings of the CVPR, 2023; pp. 5250–5259. [Google Scholar]
Bourtoule, L.; Chandrasekaran, V.; Choquette-Choo, C.A.; Jia, H.; Travers, A.; Zhang, B.; Lie, D.; Papernot, N. Machine unlearning. In Proceedings of the 2021 IEEE symposium on security and privacy (SP); IEEE, 2021; pp. 141–159. [Google Scholar]
Wu, L.; Guo, S.; Wang, J.; Hong, Z.; Zhang, J.; Ding, Y. Federated unlearning: Guarantee the right of clients to forget. IEEE Netw. 2022, 36, 129–135. [Google Scholar] [CrossRef]
Liu, Y.; Xu, L.; Yuan, X.; Wang, C.; Li, B. The right to be forgotten in federated learning: An efficient realization with rapid retraining. In Proceedings of the IEEE INFOCOM. IEEE, 2022; pp. 1749–1758. [Google Scholar]
Halimi, A.; Kadhe, S.R.; Rawat, A.; Angel, N.B. Federated Unlearning: How to Efficiently Erase a Client in FL? In Proceedings of the ICML, 2022. [Google Scholar]
Zhang, Z.Y.; Nhung, B.T.C.; Verma, A.; Ding, B.; Low, B.K.H. Achieving Exact Federated Unlearning with Improved Post-Unlearning Performance. 2025. [Google Scholar]
Pan, C.; Sima, J.; Prakash, S.; Rana, V.; Milenkovic, O. Machine unlearning of federated clusters. arXiv 2022, arXiv:2210.16424. [Google Scholar]
Liu, Z.; Jiang, Y.; Jiang, W.; Guo, J.; Zhao, J.; Lam, K.Y. Guaranteeing data privacy in federated unlearning with dynamic user participation. IEEE TDSC, 2024. [Google Scholar]
Su, N.; Li, B. Asynchronous federated unlearning. In Proceedings of the IEEE INFOCOM. IEEE, 2023; pp. 1–10. [Google Scholar]
Lin, Y.; Gao, Z.; Du, H.; Niyato, D.; Gui, G.; Cui, S.; Ren, J. Scalable Federated Unlearning via Isolated and Coded Sharding. In Proceedings of the IJCAI, 2024. [Google Scholar]
Wang, Z.; Gao, X.; Wang, C.; Cheng, P.; Chen, J. Efficient vertical federated unlearning via fast retraining. ACM Trans. Internet Technol. 2024, 24, 1–22. [Google Scholar] [CrossRef]
Zhao, Y.; Wang, P.; Qi, H.; Huang, J.; Wei, Z.; Zhang, Q. Federated unlearning with momentum degradation. IEEE IoTJ 2023, 11, 8860–8870. [Google Scholar] [CrossRef]
Zhang, L.; Zhu, T.; Zhang, H.; Xiong, P.; Zhou, W. Fedrecovery: Differentially private machine unlearning for federated learning frameworks. IEEE TIFS 2023, 18, 4732–4746. [Google Scholar] [CrossRef]
Gao, X.; Ma, X.; Wang, J.; Sun, Y.; Li, B.; Ji, S.; Cheng, P.; Chen, J. Verifi: Towards verifiable federated unlearning. IEEE TDSC 2024, 21, 5720–5736. [Google Scholar] [CrossRef]
Che, T.; Zhou, Y.; Zhang, Z.; Lyu, L.; Liu, J.; Yan, D.; Dou, D.; Huan, J. Fast federated machine unlearning with nonlinear functional theory. In Proceedings of the ICML. PMLR, 2023; pp. 4241–4268. [Google Scholar]
Chundawat, V.S.; Niroula, P.; Dhungana, P.; Schoepf, S.; Mandal, M.; Brintrup, A. Conda: Fast federated unlearning with contribution dampening. arXiv 2024, arXiv:2410.04144. [Google Scholar] [CrossRef]
Li, G.; Shen, L.; Sun, Y.; Hu, Y.; Hu, H.; Tao, D. Subspace based federated unlearning. arXiv 2023, arXiv:2302.12448. [Google Scholar] [CrossRef]
Lang, N.; Helvitz, A.; Shlezinger, N. Memory-Efficient Distributed Unlearning. arXiv 2025, arXiv:2505.03388. [Google Scholar] [CrossRef]
Gu, H.; Ong, W.; Chan, C.S.; Fan, L. Ferrari: federated feature unlearning via optimizing feature sensitivity. NeurIPS 2024, 37, 24150–24180. [Google Scholar]
Leng, Y.; Xu, L.; Liu, J.; Zhang, X.; Mei, L.; Qu, Y.; Xu, C. FedSSU: flexible and efficient decentralized unlearning for federated learning. J. Supercomput. 2025, 81, 986. [Google Scholar] [CrossRef]
Liu, Z.; Ye, H.; Jiang, Y.; Shen, J.; Guo, J.; Tjuawinata, I.; Lam, K.Y. Privacy-preserving federated unlearning with certified client removal. IEEE TIFS; 2025. [Google Scholar]
Wang, J.; Guo, S.; Xie, X.; Qi, H. Federated unlearning via class-discriminative pruning. In Proceedings of the WWW, 2022; pp. 622–632. [Google Scholar]
Xia, H.; Xu, S.; Pei, J.; Zhang, R.; Yu, Z.; Zou, W.; Wang, L.; Liu, C. Fedme 2: Memory evaluation & erase promoting federated unlearning in dtmn. IEEE J. Sel. Areas Commun. 2023, 41, 3573–3588. [Google Scholar]
Zhu, X.; Li, G.; Hu, W. Heterogeneous federated knowledge graph embedding learning and unlearning. In Proceedings of the WWW, 2023; pp. 2444–2454. [Google Scholar]
Pan, Z.; Ying, Z.; Wang, Y.; Zhang, C.; Zhang, W.; Zhou, W.; Zhu, L. Feature-based machine unlearning for vertical federated learning in iot networks. IEEE TMC; 2025. [Google Scholar]
Han, M.; Zhu, T.; Zhang, L.; Huo, H.; Zhou, W. Vertical federated unlearning via backdoor certification. IEEE TSC; 2025. [Google Scholar]
Zhao, S.; Zhang, J.; Ma, X.; Jiang, Q.; Ma, Z.; Gao, S.; Ying, Z.; Ma, J. FedWiper: Federated Unlearning via Universal Adapter. IEEE TIFS, 2025. [Google Scholar]
Zhong, Z.; Bao, W.; Wang, J.; Zhang, S.; Zhou, J.; Lyu, L.; Lim, W.Y.B. Unlearning through knowledge overwriting: Reversible federated unlearning via selective sparse adapter. In Proceedings of the CVPR, 2025; pp. 30661–30670. [Google Scholar]
Liu, G.; Ma, X.; Yang, Y.; Wang, C.; Liu, J. Federaser: Enabling efficient client-level data removal from federated learning models. In Proceedings of the IWQOS. IEEE, 2021; pp. 1–10. [Google Scholar]
Guo, X.; Wang, P.; Qiu, S.; Song, W.; Zhang, Q.; Wei, X.; Zhou, D. Fast: Adopting federated unlearning to eliminating malicious terminals at server side. IEEE Trans. Netw. Sci. Eng. 2023, 11, 2289–2302. [Google Scholar] [CrossRef]
Dhasade, A.; Ding, Y.; Guo, S.; Kermarrec, A.m.; De Vos, M.; Wu, L. Quickdrop: Efficient federated unlearning by integrated dataset distillation. arXiv 2023, arXiv:2311.15603. [Google Scholar]
Ameen, M.; Wang, P.; Su, W.; Wei, X.; Zhang, Q. Speed up federated unlearning with temporary local models. IEEE Transactions on Sustainable Computing, 2025. [Google Scholar]
Yuan, W.; Yin, H.; Wu, F.; Zhang, S.; He, T.; Wang, H. Federated unlearning for on-device recommendation. In Proceedings of the Proceedings of the sixteenth ACM international conference on web search and data mining, 2023; pp. 393–401. [Google Scholar]
Mora, A.; Dominici, L.; Bellavista, P. Fedunran: On-device federated unlearning via random labels. In Proceedings of the 2024 IEEE International Conference on Big Data (BigData). IEEE, 2024; pp. 7955–7960. [Google Scholar]
Zhang, J.; Zhao, M.; Wang, Z.; Su, W.; Wang, P. Model recovery in federated unlearning with restricted server data resources. IEEE IoTJ, 2025. [Google Scholar]
Jin, R.; Chen, M.; Zhang, Q.; Li, X. Forgettable federated linear learning with certified data removal. arXiv 2023, arXiv–2306. [Google Scholar]
Xiong, Z.; Li, W.; Li, Y.; Cai, Z. Exact-fun: an exact and efficient federated unlearning approach. In Proceedings of the ICDM. IEEE, 2023; pp. 1439–1444. [Google Scholar]
Wang, W.; Tian, Z.; Zhang, C.; Liu, A.; Yu, S. Bfu: Bayesian federated unlearning with parameter self-sharing. In Proceedings of the Proceedings of the 2023 ACM Asia Conference on Computer and Communications Security, 2023; pp. 567–578. [Google Scholar]
Tao, Y.; Wang, C.L.; Pan, M.; Yu, D.; Cheng, X.; Wang, D. Communication Efficient and Provable Federated Unlearning. CoRR 2024. [Google Scholar]
Wang, W.; Zhang, C.; Tian, Z.; Yu, S. Fedu: Federated unlearning via user-side influence approximation forgetting. IEEE TDSC, 2024. [Google Scholar]
Fraboni, Y.; Van Waerebeke, M.; Scaman, K.; Vidal, R.; Kameni, L.; Lorenzi, M. Sifu: Sequential informed federated unlearning for efficient and provable client unlearning in federated optimization. In Proceedings of the International Conference on Artificial Intelligence and Statistics. PMLR, 2024; pp. 3457–3465. [Google Scholar]
Huynh, T.T.; Nguyen, T.B.; Nguyen, P.L.; Nguyen, T.T.; Weidlich, M.; Nguyen, Q.V.H.; Aberer, K. Fast-fedul: A training-free federated unlearning with provable skew resilience. In Proceedings of the Joint European Conference on Machine Learning and Knowledge Discovery in Databases, 2024; Springer; pp. 55–72. [Google Scholar]
Huynh, T.T.; Nguyen, T.B.; Nguyen, T.T.; Nguyen, P.L.; Yin, H.; Nguyen, Q.V.H.; Nguyen, T.T. Certified unlearning for federated recommendation. ACM Trans. Inf. Syst. 2025, 43, 1–29. [Google Scholar] [CrossRef]
Li, L.; Hu, L.; Mo, K.; Ding, Z.; Wu, Y.; Yan, H.; Li, J. Inverse correction-optimized vertical federated unlearning. J. Supercomput. 2025, 81, 845. [Google Scholar] [CrossRef]
Li, Y.; Chen, C.; Zheng, X.; Zhang, J. Federated unlearning via active forgetting. arXiv 2023, arXiv:2307.03363. [Google Scholar]
Li, Y.; Zhang, J.; Liu, Y.; Chen, C. Class-wise federated unlearning: Harnessing active forgetting with teacher–student memory generation. KBS 2025, 316, 113353. [Google Scholar] [CrossRef]
Wang, F.; Huo, J.; Wang, W.; Zhang, X.; Liu, Y.; Tan, Z.; Wang, C. FedBT: Effective and Robust Federated Unlearning via Bad Teacher Distillation for Secure Internet of Things. IEEE IoTJ 2025, 30634–30648. [Google Scholar] [CrossRef]
zheng, jintao; Li, K.; Zhou, C.; Zhu, D.; Pan, C.; Du, X. Redundancy-Aware Federated Unlearning with Reverse and Selective Distillation. In Proceedings of the Submitted to International Conference on Machine Intelligence Theory and Applications under review, 2025. [Google Scholar]
Guo, Q.; Tian, Z.; Yao, M.; Qi, S.; Qi, Y.; Liu, B. Forgetting through transforming: Enabling federated unlearning via class-aware representation transformation. In Proceedings of the ICCV, 2025; pp. 1474–1483. [Google Scholar]
Daluwatta, W.; Khalil, I.; Edirimannage, S.; Atiquzzaman, M. UaaS-SFL: Unlearning as a Service for Safeguarding Federated Learning. IEEE Trans. Netw. Serv. Manag. 2025, 1029–1045. [CrossRef]
Ghannam, N.E.; Mahareek, E.A. Privacy-Preserving Federated Unlearning with Ontology-Guided Relevance Modeling for Secure Distributed Systems. Future Internet 2025, 17, 335. [Google Scholar] [CrossRef]

Figure 1. Illustration of federated learning model evolution in open environments.

Figure 2. Interdependencies among federated model evolution problems.

Figure 3. Taxonomy of federated domain adaptation methods.

Figure 4. Taxonomy of federated continual learning methods.

Figure 5. Taxonomy of federated unlearning methods.

Table 1. Comparison of survey papers related to federated collaborative evolution in open environments.

Paper	Category	Year	Field	Contribution
[9]	Centralized	2024	Continual Learning	Systematically summarizes the theoretical foundations of CL and proposes a five-category taxonomy based on the stability–plasticity trade-off.
[10]	Centralized	2024	Continual Learning	Provides a dedicated survey on class-incremental learning and evaluates the fairness and efficiency of 17 algorithms under a unified framework.
[11]	Centralized	2025	Continual Learning	Proposes a new taxonomy of vertical and horizontal continuity for LLMs and discusses continual learning challenges in the era of large models.
[12]	Centralized	2024	Continual Learning	Investigates how strong representations from pretrained models can mitigate catastrophic forgetting and improve transfer efficiency.
[7]	Non-centralized	2024	Continual Learning	Focuses on FCL, introduces the concept of spatio-temporal catastrophic forgetting, and summarizes seven knowledge-fusion frameworks.
[13]	Centralized	2024	Continual Learning	The first survey on multimodal continual learning and categorizes knowledge retention challenges under modality imbalance and complex interactions.
[14]	Non-centralized	2025	Continual Learning	Focuses on algorithm deployment in decentralized environments, emphasizing real-time stream processing on heterogeneous devices.
[15]	Centralized	2025	Machine Unlearning	Systematically defines the framework of machine unlearning and the types of removal requests, covering both exact and approximate unlearning algorithms.
[8]	Non-centralized	2024	Machine Unlearning	Reviews the handling of unlearning requests in FL and summarizes challenges such as knowledge entanglement under federated architectures.
[16]	Centralized	2025	Machine Unlearning	Proposes a fine-grained taxonomy of unlearning algorithms and discusses validation metrics as well as their extension to distributed settings.
[17]	Centralized	2024	Domain Adaptation	Systematically reviews domain adaptation techniques when source data are inaccessible and reveals their internal mechanisms through modular analysis.
[18]	Non-centralized	2025	Domain Adaptation	The first comprehensive survey of federated domain generalization and investigates how to generalize to unseen domains in distributed settings.
[6]	Non-centralized	2024	Domain Adaptation	Thoroughly analyzes federated transfer learning strategies for addressing system heterogeneity, data increment, and label scarcity.
[19]	Centralized	2024	Domain Adaptation	Focuses on source-free unsupervised settings and summarizes knowledge transfer techniques from black-box models to unlabeled target domains.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

A Comprehensive Survey of Federated Model Evolution in Open Environments

Abstract

Keywords:

Subject:

1. Introduction

2. Preliminaries

2.1. Federated Learning (FL)

Learning Process

Training Objective

2.2. Federated Domain Adaptation (FDA)

Learning Process

Training Objective

2.3. Federated Continual Learning (FCL)

Learning Process

Training Objective

2.4. Federated Unlearning (FU)

Learning Process

Training Objective

3. Federated Adaptation: Efficient Knowledge Transfer

3.1. Data Alignment

3.1.1. Virtual Domain Generation and Statistical Distribution Reconstruction

3.1.2. Cross-Domain Mixup and Style Transfer

3.2. Feature Alignment

3.2.1. Explicit Alignment via Statistical Distances

3.2.2. Adversarial Alignment via Discriminators

3.3. Model Decoupling

3.3.1. Structural Decoupling and Feature Disentanglement

3.3.2. Parameter Decoupling and Layer-Wise Partitioning

3.4. Strategy Optimization

3.4.1. Optimization via Personalized Aggregation

3.4.2. Fast Adaptation via Federated Meta-Learning

3.4.3. Knowledge Transfer via Federated Distillation

4. Federated Continual Learning: Mitigating Catastrophic Forgetting

4.1. Alignment-Based Methods

4.1.1. Feature-Based Alignment

4.1.2. Gradient- or Parameter-Based Alignment

4.1.3. Output-Space Alignment

4.2. Rehearsal-Based Methods

4.2.1. Experience Replay

4.2.2. Generative Replay

4.3. Architecture-Based Methods

4.3.1. Fixed Architecture

4.3.2. Dynamic Architecture

4.4. Aggregation-Based Methods

4.4.1. Optimization-Based Weighted Aggregation

4.4.2. Model Ensembling and Analytical Aggregation

5. Federated Unlearning: Selective Knowledge Removal

5.1. Retraining-Based Methods

5.1.1. Full Retraining

5.1.2. Partial Retraining

5.2. Model Adjustment-Based Methods

5.2.1. Parameter-Oriented Methods

5.2.2. Structure-Oriented Methods

5.3. Contribution-Reversal-Based Methods

5.3.1. Replay

5.3.2. Update Reversal

5.3.3. Generative Reconstruction

6. Evaluations and Experimental Settings

6.1. Federated Domain Adaptation

6.1.1. Evaluations

6.1.2. Experimental Settings

6.2. Federated Continual Learning

6.2.1. Evaluations

6.2.2. Experimental Settings

6.3. Federated Unlearning

6.3.1. Evaluations

6.3.2. Experimental Settings

7. Future Research Directions

8. Conclusion

References

MDPI Initiatives

Important Links

Subscribe