Convergence Rate Analysis of Non-I.I.D. SplitFed Learning with Partial Worker Participation and Auxiliary Networks

Amirreza Talebi

doi:10.20944/preprints202409.0335.v1

Submitted:

02 September 2024

Posted:

04 September 2024

You are already at the latest version

Abstract

In conventional Federated Learning (FL), clients work together to train a model managed by a central server, intending to speed up the learning process. However, this approach imposes significant computational and communication burdens on clients, particularly with complex models. Additionally, while FL strives to protect client privacy, the server's access to local and global models raises security concerns. To address these challenges, Split Learning (SL) separates the model into parts handled by the client and the server, though it suffers from inefficiencies due to sequential client participation. To overcome these issues, SplitFed Learning (SFL) was proposed, which combines the parallelism of FL with the model-splitting strategy of SL, enabling simultaneous training by multiple clients. Our main contribution is the theoretical analysis of SFL, which, for the first time, includes non-i.i.d. datasets, non-convex loss functions, and both full and partial client participation. We provide convergence proofs for a state-of-the-art SFL algorithm based on conventional convergence analysis assumptions for FL. Our results prove that we can recover the linear convergence rate of conventional FL for the SFL algorithm with the distinction that increasing the number of local steps or clients may not speed up the convergence in SFL.

Keywords:

SplitFed Learning

;

Convergence Theory

;

Federated Learning

;

Auxiliary Networks

;

Machine Learning

Subject:

Computer Science and Mathematics - Computer Science

1. Introduction

In the conventional Federated Learning (FL), several clients in parallel, train a model jointly particularly leading to speed-up in the learning process under the supervision of a server [1]. Hence, given a central server and N clients as participants in the training, an optimization problem of the below form is solved by FL:

\begin{matrix} min_{\tilde{x} \in R^{\tilde{d}}} f (\tilde{x}) \overset{Δ}{=} min_{\tilde{x} \in R^{\tilde{d}}} \frac{1}{m} \sum_{i = 0}^{m - 1} F_{i} (\tilde{x}) \end{matrix}

(1)

In which,

F_{i} (\tilde{x}) \overset{Δ}{=} E_{ξ \sim D_{i}} [F_{i} (\tilde{x}, ξ)]

can be a non-convex loss function and

ξ

corresponds to a random sample of local dataset of the client i,

D_{i}

. In the FL training, there are m clients training on their local datasets. However, FL encounters the challenge that clients must train the entire model, placing a considerable computational burden, particularly with complex and large-scale models. Additionally, gathering all client data and broadcasting the aggregated model at each round can result in substantial communication overhead. While one of the principal aims of FL is to safeguard clients’ privacy, the server retains access to both the client’s local and global models, prompting security concerns [2]. To address the computational limitations and further safeguard the privacy of the client-side model, [3] pioneered SL, dividing the ML model into two parts. The client trains one portion of the model, while the server trains the remaining portion. However, according to [2], this method incurs notable training time overhead, as only one client can engage in split learning (SL) at any given time, leaving others idle. To address this issue, they proposed SplitFed Learning (SFL), which integrates both the parallel computational capabilities of clients from FL and the benefits of split models from SL. In particular, the convergence theory of the SFL framework has not been thoroughly explored in the existing literature. Our primary contribution lies in establishing the theoretical underpinnings for SFL, incorporating the general assumptions of traditional FL, non-convex loss functions, non-identically independently distributed (non-i.i.d.) datasets, and addressing both full and partial client participation in the SFL training process. Our proof is for the state-of-the-art algorithm for SFL developed by [4] based on conventional assumptions in FL settings. We demonstrate that the SFL can still recover the linear convergence rate of conventional FL. However, changes in the number of clients and local steps cannot speed up the convergence.

2. Related work

SL and FL

The reference [5] introduces a personalized SL framework to address issues like data leakage and non-iid datasets in decentralized learning. It proposes an optimal cut layer selection method using multiplayer bargaining and the Kalai-Smorodinsky bargaining solution (KSBS). This approach efficiently balances the time of training, usage of energy, and privacy of data. Each device tailors its model for non-i.i.d. datasets while they have a common server-side model which ensures robustness by generalization. Simulation results validate the framework’s effectiveness in achieving optimal utility and addressing decentralized learning challenges. However, they do not address the communication overhead caused by transmitting the forward-propagation results at each local step. The reference [6] provides convergence analysis for Sequential Split Learning (SSL), a variant of SL in which the model training process is conducted sequentially, with each client trained one after the other, on heterogeneous data. It compares SSL with Federated Averaging (FedAvg) showing SSL’s superiority on extremely heterogeneous data. However, in practice, if the heterogeneity of data is mild, FedAvg outperforms SSL. Also, SSL still suffers from large communication overheads between the server and clients.
SplitFed learning

The reference [7] presents AdaSFL, a method designed to optimize model training efficiency by controlling local update frequency and batch size. The theoretical analysis demonstrates convergence rates, which facilitate the creation of an adaptive algorithm for adjusting update frequency and batch sizes tailored to heterogeneous workers. However, clients must obtain back-propagation results from the server at each local update. Meanwhile, [8] recommends updating client and server-side models concurrently, utilizing local-loss-based training and auxiliary networks designed specifically for split learning. This parallel training approach effectively reduces latency and eliminates the need for server-to-client communication. The paper includes latency analysis for optimal model partitioning and offers guidelines for model splitting. Specifically, [4] developed a communication and storage-efficient SFL approach. In this method, each client trains a portion of the model and calculates its local loss function using an auxiliary network, leading to reduced communication overhead. Furthermore, the server model is trained based on the sequence of forward propagation results from the clients, ensuring that only one copy of the server model is maintained at any given time. Additionally, [8] suggested a similar framework, albeit with a key difference that each client possesses its separate server model, and these models are aggregated to construct the global server model.
Auxiliary networks

Neural network training with back-propagation is hindered by inefficiencies arising from the update locking issue, where layers must await the complete propagation of signals through the network before updating [9]. To address this, [9] proposed Decoupled Greedy Learning (DGL), a more straightforward training approach that relaxes the joint training objective greedily, showing significant effectiveness for CNNs in large-scale image classification. This method optimizes the training objective using auxiliary modules or replay buffers to reduce communication delays caused by waiting for backward propagation. [10] addressed the backward update lock constraint by introducing a model that decouples modules through predictions of future computations within the network graph. These models use local information to predict the outcomes of subgraphs, particularly focusing on error gradients. By using synthetic gradients instead of true backpropagated gradients, subgraphs can update independently and asynchronously, realizing decoupled neural interfaces. A similar approach has been adopted for training in SFL by [4,8]. Indeed, they use an auxiliary model to replace the server model. The mentioned research demonstrates that an auxiliary model with a relatively smaller dimension compared to the server model performs sufficiently well in serving as a replacement.

3. SplitFed Learning Scenario

In this section, we introduce the SFL framework, encompassing both client and server-side models. Additionally, we present the CSE-SFL algorithm designed by [4] to mitigate communication overhead. Accordingly, we split the model as

\tilde{x} (x_{C}, x_{S})

where

x_{C}

denotes the client-side model, and

x_{S}

indicates the server-side model. We introduce

x (x_{C}, x_{A})

as a client-side model including the auxiliary network where

x_{A}

indicates the model for the auxiliary network.

The client-side non-convex loss function in the SFL setting is given by:

\begin{matrix} F_{i}^{c} (x) \overset{Δ}{=} E_{ξ \sim D_{i}} [F_{i}^{c} (x; ξ)] \end{matrix}

(2)

Also, the non-convex loss function in the SFL setting is defined by:

\begin{matrix} F^{s} (x_{S}; z_{f}, y) \overset{Δ}{=} \frac{1}{m} \sum_{i = 0}^{m - 1} F_{i}^{s} (x_{S}; z_{f, i}, y_{i}) \end{matrix}

(3)

We denote

z_{f, i} (x_{C}; ξ)

as the output of the forward propagation of the client i’s model,

x_{C, i}

, on its local random data sample,

ξ \in D_{i}

, which is intended to be transmitted to the server at specific intervals including the true labels

y_{i}

corresponding to the local random data sample. Note that the sampled data at the client is not shared with the server but the true labels. Similarly,

z_{b, i} (x_{S}; z_{f, i}, y_{i})

indicates the backward propagation model of the server for client i. Accordingly,

{\hat{z}}_{b, i} (x_{A}; z_{f, i}, y_{i})

corresponds to the backward propagation results obtained by the auxiliary network. In more detail, the client i performs forward propagation up to the splitting layer and transmits the output of this layer, along with the true labels, to the server. The server then continues forward propagation through to the final layer and computes the loss function. Subsequently, the server performs backward propagation of the error and sends the gradients of its first layer back to the client. We consider

{\bar{x}}_{C}^{t}

as the aggregated model at each global round

t \in [T]

where

[T] = {0, . . ., T - 1}

and

{\bar{x}}_{C}^{t} = \frac{1}{m} \sum_{i} x_{C}^{t}

. Throughout this paper,

[S] = {0, . . ., m - 1}

identifies the clients’ set which is indexed by i. We employ two strategies for client participation. The first strategy entails all clients participating in the learning process. The second strategy involves the server randomly sampling a subset of size n of clients with replacement,

[S_{t}]

, following a uniform distribution. We assume that

D_{i}

s are non-i.i.d. The derivative of local loss function of client i in SFL setting with respect to

x_{C}

and

x_{A}

are indicated by

\nabla F_{i}^{c} (x_{C})

and

\nabla F_{i}^{c} (x_{A})

respectively. As for the server-side model, the derivative of the loss function is

\nabla F^{s} (x_{S})

which is with respect to

x_{S}

. The stochastic gradients of each of the aforementioned gradients will be distinguished by a

\tilde{\nabla}

sign, e.g.,

\tilde{\nabla} F_{i}^{c} (x_{C}) = \nabla F_{i}^{c} (x_{C}; ξ)

where

ξ \sim D_{i}

is a random sample from client i dataset. Note that

μ_{L},

and

μ

are the learning rates of client-side and server-side models respectively. Client i trains

x_{C, i}

on its local dataset and renders the forward propagation results,

z_{f, i}

, to the auxiliary network at each local step k and it receives the

{\hat{z}}_{b, i}

in response. Note that

k \in [K]

indexes the local steps. Additionally, the client sends the

z_{f, i}

to the server at each global round t such that

t \equiv 0 mod l

where l is a parameter determining the frequency of this process. We have one server performing the model aggregation at each global round, completing the forward propagation of clients, and updating the server model at specific global rounds. Algorithm 1 illustrates the proposed procedure by [4] in detail.

Algorithm 1 CSE-SFL [4]

1:: At Server
2:: Initialize $x_{C}^{0}$ , $x_{A}^{0}$ and $x_{S}^{0}$
3:: for $t = 0, 1, . . ., T - 1$ do
4:: Sample a subset $S_{t}$ of n clients out of m clients
5:: Receive $x_{C, i}^{t}, x_{A, i}^{t} \forall i \in [S_{t}]$
6:: Let ${\bar{x}}_{C}^{t} \leftarrow \frac{1}{m} \sum_{i \in [S_{t}]} x_{C, i}^{t}$ and ${\bar{x}}_{A}^{t} \leftarrow \frac{1}{m} \sum_{i \in [S_{t}]} x_{A, i}^{t}$
7:: Broadcast ${\bar{x}}_{C}^{t}$ and ${\bar{x}}_{A}^{t}$ to clients
8:: if $t \equiv 0 mod l, and t \neq 0$ then
9:: for each client $i \in [S_{t}]$ in sequence do
10:: $z_{f, i}, y_{i} \leftarrow$ Client( $i, z_{f}, y$ )
11:: Complete forward propagation with $z_{f, i}$ , and $x_{S}^{0}$
12:: Compute ${\hat{y}}_{i}$ , the prediction of $y_{i}$
13:: Compute loss function $F_{i}^{s} (x_{S}^{0}; z_{f, i}, y_{i})$
14:: Complete backward-propagation
15:: Send $z_{b, i}$ to the client
16:: Update server model: $x_{S}^{0} \leftarrow x_{S}^{0} - \frac{μ}{m} \nabla F_{i}^{s} (x_{S}^{0}; z_{f, i}, y_{i})$
17:: end for
18:: end if
19:: end for
20:: Concatenate $x_{C}$ and $x_{S}$
21:: At Clients :
22:: for all clients $i \in [S_{t}]$ in parallel at round t do
23:: $x_{C, i}^{0}$ , $\leftarrow Server ({\bar{x}}_{C}^{t})$
24:: if $t \equiv 0 mod l, and t \neq 0$ then
25:: $z_{f, i} \leftarrow ForwardPass (x_{C, i}^{0}; ξ)$
26:: Send $z_{f, i}$ and $y_{i}$ to the server
27:: $z_{b, i}^{t}$ , $\leftarrow Server (z_{b}^{t})$
28:: Complete backward-propagation with $z_{b, i}^{t}$
29:: Client update: $x_{C, i}^{1} \leftarrow x_{C, i}^{0} - μ_{L} \nabla F_{i}^{c} (x_{C, i}^{0})$
30:: Auxiliary update: $x_{A, i}^{1} \leftarrow x_{A, i}^{0}$
31:: for local step $k = 1, . ., K - 1$ do
32:: Compute forward propagation with $x_{C, i}^{k}$ and $x_{A}^{t}$
33:: Compute local loss $F_{i}^{c} (x_{i}^{k}; ξ^{k})$
34:: Client update: $x_{C, i}^{k + 1} \leftarrow x_{C, i}^{k} - μ_{L} \nabla F_{i}^{c} (x_{C, i}^{k})$
35:: Auxiliary update: $x_{A, i}^{k + 1} \leftarrow x_{A, i}^{k} - μ_{L} \nabla F_{i}^{c} (x_{A, i}^{k})$
36:: end for
37:: else
38:: for local step $k = 0, . ., K - 1$ do
39:: Compute forward propagation with $x_{C, i}^{k}$ and $x_{A}^{t}$
40:: Compute local loss $F_{i}^{c} (x_{i}^{k}; ξ^{k})$
41:: Client update: $x_{C, i}^{k + 1} \leftarrow x_{C, i}^{k} - μ_{L} \nabla F_{i}^{c} (x_{C, i}^{k})$
42:: Auxiliary update: $x_{A, i}^{k + 1} \leftarrow x_{A, i}^{k} - μ_{L} \nabla F_{i}^{c} (x_{A, i}^{k})$
43:: end for
44:: end if
45:: Return $x_{C, i}^{K}$ to the server
46:: end for

4. Convergence rate analysis

The following assumptions for the convergence rate evaluation have been made:

Assumption 1.(L-Lipschitz continuous gradient) Both client and server-side models are

L -

smooth non-convex functions, i.e., there is a constant

L > 0

such that

\forall x_{C}, y_{C} \in R^{d_{c}}

, and

\forall x_{S}, y_{S} \in R^{d_{s}} :

\begin{matrix} ∥ \nabla F^{c} (x_{C}) - \nabla F^{c} (y_{C}) ∥ \leq L ∥ x_{C} - y_{C} ∥ and ∥ \nabla F^{s} (x_{S}) - \nabla F^{s} (y_{S}) ∥ \leq L ∥ x_{S} - y_{S} ∥ \end{matrix}

Assumption 2.(Unbiased local gradient estimator) We assume that

\forall i \in [S]

,

\begin{matrix} E_{ξ \in D_{i}} [\nabla F_{i}^{c} (x_{C}; ξ)] = \nabla F_{i}^{c} (x_{C}) \end{matrix}

that is the local gradient estimator of the client-side model is unbiased. The expectation is over all the local datasets of the client. Note that we have a similar assumption for the server-side model as follows

\forall i \in [S]

:

\begin{matrix} E_{ξ \in D_{i}} [\nabla F_{i}^{s} (x_{S}; z_{f, i} (x_{C}; ξ))] = \nabla F_{i}^{s} (x_{S}) \end{matrix}

Assumption 3.(Bounded local and global variance) We have bounded variance of the stochastic gradients locally and globally for both server-side and client-side models, i.e., there exist positive constants

σ_{L}

and

σ_{G}

such that

\begin{matrix} E [∥ {\nabla F_{i}^{c} (x_{C}; ξ) - \nabla F_{i}^{c} (x_{C})}^{2} ∥] \leq σ_{L}^{2} and E [∥ {\nabla F_{i}^{c} (x_{C}) - \nabla F^{c} (x_{C})}^{2} ∥] \leq σ_{G}^{2} \\ E [∥ {\nabla F_{i}^{s} (x_{S}; ξ) - \nabla F_{i}^{s} (x_{S})}^{2} ∥] \leq σ_{L}^{2} and E [∥ {\nabla F_{i}^{s} (x_{S}) - \nabla F^{s} (x_{S})}^{2} ∥] \leq σ_{G}^{2} \end{matrix}

Assumptions 1, 2, and 3 are natural assumptions applied in non-convex optimization and FL, e.g., see [7,11,12,13,14,15]. Figure 1 gives an overview of the communication and storage of efficient federated split learning (CSE-FSL) algorithm in an illustrative way.

4.1. Client-Side Model Convergence

We examine the convergence rate when

t \equiv 0 mod l

because it is during these rounds that the server-side model is also updated. This will let us study the impact of l on the convergence rate and communication overhead.

Theorem 1.

Under Assumptions 1, 2, 3, and full participation of clients, if

μ_{L} \leq \frac{1}{l L K 2^{1.15 l + 1.85}}

, and

t \equiv 0 mod l

, in Algorithm 1, the convergence rate of client model of Algorithm 1 satisfies:

\begin{matrix} min_{t \in [T]} E [∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \leq \frac{2 (F^{c} ({\bar{x}}_{C}^{0}) - F^{c} ({\bar{x}}_{C}^{*}))}{(1 - Γ λ_{2}) μ_{L} K T} + Φ_{1} + \frac{Γ λ_{1}}{1 - Γ λ_{2}} \end{matrix}

\begin{matrix} Where, \\ Φ_{1} = \frac{5 K μ_{L}^{2} L^{2} (σ_{L}^{2} + 6 K σ_{G}^{2}))}{1 - Γ λ_{2}} + \\ \frac{4 L μ_{L} + 4 μ_{L}^{2} K L^{2} (l - 1)}{1 - Γ λ_{2}} ((K l + 5 L^{2} K^{2} l μ_{L}^{2}) σ_{L}^{2} + (K l + 30 L^{2} K^{3} l μ_{L}^{2}) σ_{G}^{2}) \\ λ_{1} = B \sum_{j = 0}^{l - 1} \frac{A^{j} - 1}{A - 1}, λ_{2} = \frac{A^{l} - 1}{A - 1}, \\ B = 8 L^{2} μ_{L}^{2} ((K^{2} + 5 L^{2} K^{3} μ_{L}^{2}) σ_{L}^{2} + (K^{2} + 30 L^{2} K^{4} μ_{L}^{2}) σ_{G}^{2}), \\ A = 8 L^{2} μ_{L}^{2} (K^{2} + 30 L^{2} K^{3} μ_{L}^{2}) + 2, \\ Γ = 4 (L μ_{L} + μ_{L}^{2} K L^{2} (l - 1)) (K + 30 L^{2} K^{2} μ_{L}^{2}) + 30 K^{2} μ_{L}^{2} L^{2}), and \\ {\bar{x}}_{C}^{*} = \underset{{\bar{x}}_{C}^{t}, t \in [T]}{argmin} E [∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \end{matrix}

Corollary 1.

Let

μ_{L} \leq \frac{1}{l L K 2^{1.15 l + 1.85} \sqrt{T}}

. Then, the convergence rate of the client-side model in Algorithm 1 is

\begin{matrix} min_{t \in [T]} E [∥ {\nabla f^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \leq O (\frac{l}{\sqrt{T}} + \frac{1}{T \sqrt{T}}) . \end{matrix}

(4)

Theorem 2.

Under Assumptions 1, 2, 3, and partial participation of clients due to strategy one, if

μ_{L} \leq \frac{1}{l L K 2^{1.15 l + 1.85}}

, and

t \equiv 0 mod l

, in Algorithm 1, the convergence rate of client model of Algorithm 1 satisfies:

\begin{matrix} min_{t \in [T]} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ \leq \frac{2 (F^{c} ({\bar{x}}_{C}^{0}) - F^{c} ({\bar{x}}_{C}^{*}))}{μ_{L} K T} + \\ (5 K μ_{L}^{2} L^{2} + 4 μ_{L}^{2} L^{2} (K^{2} l^{2} + 5 L^{2} K^{3} l^{2} μ_{L}^{2}) + L μ_{L} (\frac{1}{n} + 15 K^{2} L^{2} μ_{L}^{2})) σ_{L}^{2} + \\ (30 K^{2} μ_{L}^{2} L^{2} + L μ_{L} (90 K^{3} L^{2} μ_{L}^{2} + 3 K) + 4 μ_{L}^{2} L^{2} (K^{2} l^{2} + 30 L^{2} K^{4} l^{2} μ_{L}^{2})) σ_{G}^{2} + \\ \frac{Γ^{'} λ_{1}}{1 - Γ^{'} λ_{2}} \end{matrix}

Where

\begin{matrix} Γ^{'} = 4 μ_{L}^{2} L^{2} (K^{2} l + 30 L^{2} K^{3} l μ_{L}^{2}) + \frac{L μ_{L}}{l} (90 l K^{3} L^{2} μ_{L}^{2} + 3 K) + 30 K^{2} μ_{L}^{2} L^{2} \\ and, \\ {\bar{x}}_{C}^{*} = \underset{{\bar{x}}_{C}^{t}, t \in [T]}{argmin} E [∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \end{matrix}

Corollary 2.

Let

μ_{L} \leq \frac{1}{l L K 2^{1.15 l + 1.85} \sqrt{T}}

. Then, the convergence rate of the client-side model in Algorithm 1 is

\begin{matrix} min_{t \in [T]} E [∥ {\nabla f^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \leq O (\frac{l}{\sqrt{T}} + \frac{1}{T \sqrt{T}}) . \end{matrix}

(5)

4.2. Server-Side Model Convergence

Theorem 3.

Under Assumptions 1, 2, 3, and full participation of clients, if

μ \leq \frac{1}{2 L}

, and

t \equiv 0 mod l

, the convergence rate of the server model of Algorithm 1 satisfies:

\begin{matrix} min_{t \in [T]} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \leq \frac{2 l (F^{s} (x_{S}^{0}) - F^{s} (x_{S}^{*}))}{μ (2 m - 3) T} + \frac{L μ m^{2}}{2 m - 3} (9.2 σ_{L}^{2} + 13.2 σ_{G}^{2}) \end{matrix}

Where

x_{S}^{*} = {argmin}_{x_{S}^{t}, t \in [T]} E ∥ {\nabla F^{s} (x_{S}^{t}, z_{f}^{t})}^{2} ∥

.

Corollary 3.

Let

μ \leq \frac{1}{2 L \sqrt{T}}

, then the convergence rate of the server-side model is:

\begin{matrix} min_{t \in [T]} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \leq O (\frac{l}{\sqrt{T}}) \end{matrix}

Theorem 4.

Under Assumptions 1, 2, 3, and partial participation of clients due to strategy one, if

μ \leq \frac{1}{8 L^{2} m^{2}}

, and

t \equiv 0 mod l

, the convergence rate of the server model of Algorithm 1 satisfies:

\begin{matrix} min_{t \in [T]} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \leq \frac{l (F^{s} (x_{S}^{0}) - F^{s} (x_{S}^{*}))}{μ (m - 2) T} + \frac{L μ m^{2}}{m - 2} (7 σ_{L}^{2} + 7 σ_{G}^{2}) \end{matrix}

Where

x_{S}^{*} = {argmin}_{x_{S}^{t}} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2}

.

\begin{matrix} ∥ \end{matrix}

Corollary 4.

Let

μ \leq \frac{1}{L^{2} m^{2} \sqrt{T}}

, then the convergence rate of the server-side model is:

\begin{matrix} min_{t \in [T]} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \leq O (\frac{l}{\sqrt{T}}) \end{matrix}

5. Discussion and Conclusions

In this paper, we proposed theoretical convergence proofs for the state-of-the-art SplitFed Learning algorithm, CSE-FSL, which is designed to improve the convergence rates of both client-side and server-side models leveraging parallelism power of Federated Learning (FL) and reduce the storage at the server by keeping one copy at a time policy. Our approach leverages several key assumptions that are conventional in FL to underpin the theoretical foundations for CSE-FSL convergence. We prove the convergence for the cases where we have non-i.i.d. datasets, and non-convex loss functions given full and partial client participation scenarios.

5.1. Summary of Contributions

Convergence Analysis: We clearly formulated the CSE-FSL algorithm developed by [4]. We conducted a comprehensive convergence rate analysis under both full and partial client participation scenarios given the non-i.i.d. dataset and non-convex loss function. The convergence guarantees are derived under several assumptions, including L-smoothness of the objective functions, unbiased gradient estimators, and bounded gradient variances which are natural in conventional FL convergence analysis.
Key Results:

−

Client-Side Model: We demonstrated that, under full client participation, the client-side model converges with a rate of $O (\frac{l}{\sqrt{T}} + \frac{1}{T \sqrt{T}})$ . This result highlights the effectiveness of the algorithm in achieving linear convergence rates while accommodating the federated setting’s constraints and sequential update of the server model. An increase in l, causes a longer convergence time which is obvious as it means the server model will be updated after more global rounds.

−

Server-Side Model: For the server-side model, we established convergence rates of $O (\frac{l}{\sqrt{T}})$ under both full and partial client participation scenarios. This result underscores the robustness of the algorithm in ensuring effective learning even when clients participate partially. This also demonstrates that the number of clients and their local steps are not effective in speeding up the convergence in contrast to FL settings.

5.2. Implications

Our findings underscore the importance of efficient communication and gradient estimation (auxiliary networks) techniques in SplitFed Learning (SFL). The derived convergence rates demonstrate that the CSE-FSL algorithm achieves a balance between computational efficiency and convergence performance, making it a viable solution for practical federated learning applications where the privacy of clients is of high importance.

The theoretical guarantees provided by our convergence analysis offer valuable insights into how the algorithm performs under various conditions, thus guiding practitioners in optimizing federated learning systems. Future work could extend these results to explore more complex scenarios and refine the algorithm further for enhanced performance in real-world applications. For example, considering stragglers, elimination of label sharing by clients, and determining the optimal cut layer seem to be promising avenues for further research.

In summary, the CSE-FSL algorithm represents a significant advancement in FL, providing a robust framework for effective model training leveraging the parallelism power of FL, auxiliary networks, and sequential updates of the server-side model which helps reduce storage on the server side. It recovers the linear convergence speed of FL while providing more privacy by only forward-propagation and label transition between clients and servers instead of trained parameters.

Appendix A. Proofs

Appendix A.1. Client-Side Model Convergence

We examine the convergence rate when

t \equiv 0 mod l

because it is during these rounds that the server-side model is also updated. This analysis allows us to investigate the influence of l on both the convergence rate and communication overhead.

Theorem A1.

Under Assumptions 1, 2, 3, and full participation of clients, if

μ_{L} \leq \frac{1}{l L K 2^{1.15 l + 1.85}}

, and

t \equiv 0 mod l

, in Algorithm 1, the convergence rate of client model of Algorithm 1 satisfies:

\begin{matrix} min_{t \in [T]} E [∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \leq \frac{2 (F^{c} ({\bar{x}}_{C}^{0}) - F^{c} ({\bar{x}}_{C}^{*}))}{(1 - Γ λ_{2}) μ_{L} K T} + Φ_{1} + \frac{Γ λ_{1}}{1 - Γ λ_{2}} \end{matrix}

\begin{matrix} Where, \\ Φ_{1} = \frac{5 K μ_{L}^{2} L^{2} (σ_{L}^{2} + 6 K σ_{G}^{2}))}{1 - Γ λ_{2}} + \\ \frac{4 L μ_{L} + 4 μ_{L}^{2} K L^{2} (l - 1)}{1 - Γ λ_{2}} ((K l + 5 L^{2} K^{2} l μ_{L}^{2}) σ_{L}^{2} + (K l + 30 L^{2} K^{3} l μ_{L}^{2}) σ_{G}^{2}) \\ λ_{1} = B \sum_{j = 0}^{l - 1} \frac{A^{j} - 1}{A - 1}, λ_{2} = \frac{A^{l} - 1}{A - 1}, \\ B = 8 L^{2} μ_{L}^{2} ((K^{2} + 5 L^{2} K^{3} μ_{L}^{2}) σ_{L}^{2} + (K^{2} + 30 L^{2} K^{4} μ_{L}^{2}) σ_{G}^{2}), \\ A = 8 L^{2} μ_{L}^{2} (K^{2} + 30 L^{2} K^{3} μ_{L}^{2}) + 2, \\ Γ = 4 (L μ_{L} + μ_{L}^{2} K L^{2} (l - 1)) (K + 30 L^{2} K^{2} μ_{L}^{2}) + 30 K^{2} μ_{L}^{2} L^{2}), and \\ {\bar{x}}_{C}^{*} = \underset{{\bar{x}}_{C}^{t}, t \in [T]}{argmin} E [∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \end{matrix}

Corollary A1.

Let

μ_{L} \leq \frac{1}{l L K 2^{1.15 l + 1.85} \sqrt{T}}

. Then, the convergence rate of the client-side model in Algorithm 1 is

\begin{matrix} min_{t \in [T]} E [∥ {\nabla f^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \leq O (\frac{l}{\sqrt{T}} + \frac{1}{T \sqrt{T}}) . \end{matrix}

(A1)

Proof.

In this proof, all the gradients are w.r.t.

x_{C}

. Due to Assumption 1, for any

{\bar{x}}_{C}^{t + l}

and

{\bar{x}}_{C}^{t}

such that

t \in [T]

, we can write:

\begin{matrix} F^{c} ({\bar{x}}_{C}^{t + l}) \leq F^{c} ({\bar{x}}_{C}^{t}) + \nabla F^{c} {({\bar{x}}_{C}^{t})}^{⊤} ({\bar{x}}_{C}^{t + l} - {\bar{x}}_{C}^{t}) + \frac{L}{2} ∥ {\bar{x}}_{C}^{t + l} - {\bar{x}}_{C}^{t}^{2} ∥ \end{matrix}

(A2)

Particularly, we consider the case when

t \equiv 0 mod l

from now on.

Also, note the global aggregation and client update rule in the Algorithm 1,

\begin{matrix} {\bar{x}}_{C}^{t + l} = \frac{1}{m} \sum_{i = 0}^{m - 1} x_{C, i}^{t + l} = \frac{1}{m} \sum_{i = 0}^{m - 1} (x_{C, i}^{t} - μ_{L} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k})) = \\ {\bar{x}}_{C}^{t} - \frac{μ_{L}}{m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k}) \end{matrix}

(A3)

Thus,

{\bar{x}}_{C}^{t + l} - {\bar{x}}_{C}^{t} = - \frac{μ_{L}}{m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k})

(A4)

Taking expectation of

F^{c} (x_{C}^{t + 1})

with respect to randomness at round

t + l - 1

, i.e.,

ξ^{[t + l - 1]} \overset{Δ}{=} {[ξ_{i}^{τ}]}_{i \in [N], τ \in [t + l - 1]}

, and plugging A3 into A2 note that:

\begin{matrix} E [F^{c} ({\bar{x}}_{C}^{t + l})] \leq F^{c} ({\bar{x}}_{C}^{t}) + 〈\nabla F^{c} ({\bar{x}}_{C}^{t}), E [{\bar{x}}_{C}^{t + l} - {\bar{x}}_{C}^{t}]〉 + \frac{L}{2} E ∥ {\bar{x}}_{C}^{t + l} - {\bar{x}}_{C}^{t}^{2} ∥ \\ E [F^{c} ({\bar{x}}_{C}^{t + l})] \leq F^{c} ({\bar{x}}_{C}^{t}) + μ_{L} \underset{A_{1}}{\underset{︸}{〈\nabla F^{c} ({\bar{x}}_{C}^{t}), E [\frac{- 1}{m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} (\tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k}) - \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j}))]〉}} \\ \underset{A_{2}}{\underset{︸}{- μ_{L} 〈\nabla F^{c} ({\bar{x}}_{C}^{t}), E [\frac{K}{m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j})]〉}} + \frac{L μ_{L}^{2}}{2} \underset{A_{3}}{\underset{︸}{E [∥ {\frac{1}{m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k})}^{2} ∥]}} \end{matrix}

(A5)

We bound the term

A_{1}

as follows:

\begin{matrix} A_{1} = 〈\nabla F^{c} ({\bar{x}}_{C}^{t}), E [\frac{- 1}{m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} (\tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k}) - \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j}))]〉 \\ = 〈\nabla F^{c} ({\bar{x}}_{C}^{t}), E [\underset{y_{1}}{\underset{︸}{\frac{- 1}{m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} (\nabla F_{i}^{c} (x_{C, i}^{t + j, k}) - \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j}))}}]〉 \\ \overset{(a_{1})}{=} \frac{K}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ + \frac{1}{2 K m^{2}} E ∥ {\sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} (\nabla F_{i}^{c} (x_{C, i}^{t + j, k}) - \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j}))}^{2} ∥ \\ - \frac{1}{2 K m^{2}} E ∥ {\sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} (\nabla F_{i}^{c} (x_{C, i}^{t + j, k}) - \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j})) + K \nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ \\ \overset{(a_{2})}{\leq} \frac{K}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} + \frac{1}{2 K m^{2}} E ∥ {\sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} (\nabla F_{i}^{c} (x_{C, i}^{t + j, k}) - \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j}))}^{2} ∥ \\ \overset{(a_{3})}{\leq} \frac{K}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ + \frac{l L^{2}}{2 m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} E [∥ {\nabla F_{i}^{c} (x_{C, i}^{t + j, k}) - \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥] \end{matrix}

\begin{matrix} \overset{(a_{4})}{\leq} \frac{K}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ + \frac{l L^{2}}{2 m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} \underset{y_{2}}{\underset{︸}{E [∥ {x_{C, i}^{t + j, k} - {\bar{x}}_{C}^{t + j}}^{2} ∥]}} \end{matrix}

(A6)

\begin{matrix} We have 〈a, b〉 = \frac{1}{2} (∥ a^{2} ∥ + ∥ b^{2} ∥ - ∥ {a - b}^{2} ∥) for any two vectors a and b . \end{matrix}

(A7)

\begin{matrix} Thus, if we put a = \sqrt{K} \nabla F^{c} ({\bar{x}}_{C}^{t}), and b = \frac{1}{\sqrt{K}} y_{1}, it yields equality (a_{1}) . Inequality (a_{2}) \\ follows from eliminating a strictly negative term . Now, due to E ∥ {\sum_{i}^{n} z_{i}}^{2} ∥ \leq n \sum_{i} E [∥ {z_{i}}^{2} ∥] \\ for any random variables z_{i}, inequality (a_{3}) holds . \end{matrix}

(A8)

\begin{matrix} The inequality a_{4} follows from Assumption 1 . There is an upper bound for the term y_{2} \\ provided by [13] . To preserve the integrity of the work, we include it here as well . \\ E [∥ {x_{C, i}^{t + j, k} - {\bar{x}}_{C}^{t + j}}^{2} ∥] = E [∥ {x_{C, i}^{t + j, k - 1} - {\bar{x}}_{C}^{t + j} - μ_{L} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k - 1})}^{2} ∥] \\ = E ∥ x_{C, i}^{t + j, k - 1} - {\bar{x}}_{C}^{t + j} - μ_{L} (\tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k - 1}) - \nabla F_{i}^{c} (x_{C, i}^{t + j, k - 1}) + \nabla F_{i}^{c} (x_{C, i}^{t + j, k - 1}) \\ - \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j}) + \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j}) - \nabla F^{c} ({\bar{x}}_{C}^{t + j}) + \nabla F^{c} ({\bar{x}}_{C}^{t + j})) ∥^{2} \\ \overset{(a_{5})}{\leq} (1 + \frac{1}{2 K - 1}) E ∥ {x_{C, i}^{t + j, k - 1} - {\bar{x}}_{C}^{t + j}}^{2} ∥ + E ∥ {μ_{L} (\tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k - 1}) - \nabla F_{i}^{c} (x_{C, i}^{t + j, k - 1}))}^{2} ∥ + \\ 6 K E ∥ {μ_{L} (\nabla F_{i}^{c} (x_{C, i}^{t + j, k - 1}) - \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j}))}^{2} ∥ + 6 K E ∥ {μ_{L} (\nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j}) - \nabla F^{c} ({\bar{x}}_{C}^{t + j}))}^{2} ∥ \\ + 6 K E ∥ {μ_{L} \nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \\ \overset{(a_{6})}{\leq} (1 + \frac{1}{K - 1}) E ∥ {x_{C, i}^{t + j, k - 1} - {\bar{x}}_{C}^{t + j}}^{2} ∥ + μ_{L}^{2} σ_{L}^{2} + 6 K μ_{L}^{2} σ_{G}^{2} + 6 K μ_{L}^{2} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \end{matrix}

(A9)

The inequality

(a_{5})

follows from the fact that

E ∥ {\sum_{i} z_{i}}^{2} ∥ \leq E [\sum_{i} ∥ {z_{i}}^{2} ∥]

holds true for independent random variables

z_{i}

with zero mean.

(A10)

The term

(a_{6})

is due to Assumption 3. Finally, by unrolling recursion and some simplification, we have:

\begin{matrix} \frac{1}{m} \sum_{i = 0}^{m - 1} E ∥ {x_{C, i}^{t + j, k} - {\bar{x}}_{C}^{t + j}}^{2} ∥ \leq 5 K μ_{L}^{2} (σ_{L}^{2} + 6 K σ_{G}^{2}) + 30 K^{2} μ_{L}^{2} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \\ Now, we continue by substituting A 11 into A 6, \end{matrix}

(A11)

\begin{matrix} \leq \frac{K}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ + \frac{5 K^{2} μ_{L}^{2} l L^{2}}{2} (σ_{L}^{2} + 6 K σ_{G}^{2}) + 15 K^{3} μ_{L}^{2} l L^{2} \sum_{j = 0}^{l - 1} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \end{matrix}

(A12)

The above inequality, A12, is an upper bound for the term

A_{1}

. We continue with bounding

A_{2}

as follows.

\begin{matrix} A_{2} = - μ_{L} 〈\nabla F^{c} ({\bar{x}}_{C}^{t}), E [\frac{K}{m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j})]〉 \\ \overset{(a_{7})}{=} - μ_{L} K 〈\nabla F^{c} ({\bar{x}}_{C}^{t}), E [\sum_{j = 0}^{l - 1} \nabla F^{c} ({\bar{x}}_{C}^{t + j})]〉 \\ = - μ_{L} K ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ - μ_{L} K 〈\nabla F^{c} ({\bar{x}}_{C}^{t}), E [\sum_{j = 1}^{l - 1} \nabla F^{c} ({\bar{x}}_{C}^{t + j})]〉 \\ \overset{(a_{8})}{\leq} - μ_{L} K ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ - μ_{L} K 〈\nabla F^{c} ({\bar{x}}_{C}^{t}), E [(l - 1) \nabla F^{c} ({\bar{x}}_{C}^{t + j^{*}})]〉 \end{matrix}

\begin{matrix} \overset{(a_{9})}{=} - \frac{μ_{L} K (l + 1)}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ \underset{y_{3}}{\underset{︸}{- \frac{μ_{L} K (l - 1)}{2} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j^{*}})}^{2}}} ∥ \\ + \frac{μ_{L} K (l - 1)}{2} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j^{*}}) - \nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ \\ \overset{(a_{10})}{\leq} - \frac{μ_{L} K (l + 1)}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ + \frac{μ_{L} K (l - 1)}{2} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j^{*}}) - \nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ \\ \overset{(a_{11})}{\leq} - \frac{μ_{L} K (l + 1)}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ + \frac{μ_{L} K L^{2} (l - 1)}{2} E ∥ {\bar{x}}_{C}^{t + j^{*}} - {\bar{x}}_{C}^{t}^{2} ∥ \end{matrix}

\begin{matrix} \overset{(a_{12})}{\leq} - \frac{μ_{L} K (l + 1)}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ + \frac{μ_{L}^{3} K L^{2} (l - 1)}{2} \underset{y_{4}}{\underset{︸}{E ∥ {\frac{1}{m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{j^{*}} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k})}^{2} ∥}} \end{matrix}

(A13)

\begin{matrix} \overset{(a_{13})}{\leq} - \frac{μ_{L} K (l + 1)}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ + \frac{μ_{L}^{3} K L^{2} (l - 1)}{2} \underset{A_{3}}{\underset{︸}{E ∥ {\frac{1}{m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k})}^{2}}} ∥ \end{matrix}

(A14)

The equality

(a_{7})

follows from definition of the global aggregation in Algorithm 1. In inequality

(a_{8})

, we assume there exists a

j^{*}

such that

j^{*} = {argmin}_{1 \leq j \leq l - 1} 〈\nabla F^{c} ({\bar{x}}_{C}^{t}), E [\nabla F^{c} ({\bar{x}}_{C}^{t + j})]〉

. The equality

(a_{9})

follows from A7, where

a = \nabla F^{c} ({\bar{x}}_{C}^{t})

and

b = \nabla F^{c} ({\bar{x}}_{C}^{t + j^{*}})

. The inequality

(a_{10})

is due to the fact that the term

y_{3}

is negative. Thus, it can be eliminated safely. Due to Assumption 1, we have inequality

(a_{11})

. The inequality

(a_{12})

is due to equation A3. For the term

y_{4}

in A13, there is an upper bound when

j^{*} = l - 1

due to A15. Hence, with

j^{*} = l - 1

, inequality

(a_{13})

is achieved. We proceed with bounding

A_{3}

as follows.

\begin{matrix} E ∥ {\frac{1}{m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k})}^{2} ∥ \\ = \frac{1}{m^{2}} E ∥ \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} (\tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k}) - \nabla F_{i}^{c} (x_{C, i}^{t + j, k}) + \nabla F_{i}^{c} (x_{C, i}^{t + j, k}) - \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j}) \\ + \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j}) - \nabla F^{c} ({\bar{x}}_{C}^{t + j}) + \nabla F^{c} ({\bar{x}}_{C}^{t + j})) ∥^{2} \\ \overset{(a_{14})}{\leq} \frac{4 K l}{m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} (E ∥ {\tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k}) - \nabla F_{i}^{c} (x_{C, i}^{t + j, k})}^{2} ∥ + E ∥ {\nabla F_{i}^{c} (x_{C, i}^{t + j, k}) - \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ + \\ E ∥ {\nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j}) - \nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ + E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥) \\ \overset{(a_{15})}{\leq} \frac{4 K l}{m} \sum_{i = 0}^{m - 1} \sum_{j = 0}^{l - 1} \sum_{k = 0}^{K - 1} (σ_{L}^{2} + L^{2} E ∥ {x_{C, i}^{t + j, k} - {\bar{x}}_{C}^{t + j}}^{2} ∥ + σ_{G}^{2} + E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥) \\ \overset{(a_{16})}{\leq} 4 (K^{2} l^{2} + 5 L^{2} K^{3} l^{2} μ_{L}^{2}) σ_{L}^{2} + 4 (K^{2} l^{2} + 30 L^{2} K^{4} l^{2} μ_{L}^{2}) σ_{G}^{2} + 4 (K^{2} l + \\ 30 L^{2} K^{3} l μ_{L}^{2}) \sum_{j = 0}^{l - 1} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \end{matrix}

(A15)

The inequality

(a_{14})

holds due to A8, and inequality

(a_{15})

follows from Assumptions 2 and A3. Due to the bound on client drift, A11, note the inequality

(a_{16})

. Substituting A12, A14, and A15 into A5, observe that:

\begin{matrix} E [F^{c} ({\bar{x}}_{C}^{t + l})] \leq F^{c} ({\bar{x}}_{C}^{t}) \\ + μ_{L} (\frac{K}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ + \frac{5 K^{2} μ_{L}^{2} l L^{2}}{2} (σ_{L}^{2} + 6 K σ_{G}^{2}) + 15 K^{3} μ_{L}^{2} l L^{2} \sum_{j = 0}^{l - 1} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥) \\ - \frac{μ_{L} K (l + 1)}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ + 2 (L μ_{L}^{2} + μ_{L}^{3} K L^{2} (l - 1)) ((K^{2} l^{2} + 5 L^{2} K^{3} l^{2} μ_{L}^{2}) σ_{L}^{2} + \\ (K^{2} l^{2} + 30 L^{2} K^{4} l^{2} μ_{L}^{2}) σ_{G}^{2} + (K^{2} l + 30 L^{2} K^{3} l μ_{L}^{2}) \sum_{j = 0}^{l - 1} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥) \\ Rearranging and simplifying the terms, \\ E [F^{c} ({\bar{x}}_{C}^{t + l})] \leq F^{c} ({\bar{x}}_{C}^{t}) - \frac{μ_{L} K l}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ + μ_{L} (\frac{5 K^{2} μ_{L}^{2} l L^{2}}{2} (σ_{L}^{2} + 6 K σ_{G}^{2})) + \\ 2 (L μ_{L}^{2} + μ_{L}^{3} K L^{2} (l - 1)) ((K^{2} l^{2} + 5 L^{2} K^{3} l^{2} μ_{L}^{2}) σ_{L}^{2} + (K^{2} l^{2} + 30 L^{2} K^{4} l^{2} μ_{L}^{2}) σ_{G}^{2}) + \\ (2 (L μ_{L}^{2} + μ_{L}^{3} K L^{2} (l - 1)) (K^{2} l + 30 L^{2} K^{3} l μ_{L}^{2}) + 15 K^{3} μ_{L}^{3} l L^{2}) \sum_{j = 0}^{l - 1} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \\ By iterating over t, note that, \\ \sum_{t \in [T]} E [∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \leq \frac{2}{μ_{L} K l} (F^{c} ({\bar{x}}_{C}^{0}) - F^{c} ({\bar{x}}_{C}^{*})) + \frac{T}{l} (5 K μ_{L}^{2} L^{2} (σ_{L}^{2} + 6 K σ_{G}^{2})) + \\ \frac{4 T}{l} (L μ_{L} + μ_{L}^{2} K L^{2} (l - 1)) ((K l + 5 L^{2} K^{2} l μ_{L}^{2}) σ_{L}^{2} + (K l + 30 L^{2} K^{3} l μ_{L}^{2}) σ_{G}^{2}) + \end{matrix}

\begin{matrix} \underset{Γ}{\underset{︸}{(4 (L μ_{L} + μ_{L}^{2} K L^{2} (l - 1)) (K + 30 L^{2} K^{2} μ_{L}^{2}) + 30 K^{2} μ_{L}^{2} L^{2})}} \underset{y_{5}}{\underset{︸}{\sum_{t} \sum_{j = 0}^{l - 1} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2}}} ∥ \\ We bound the term y_{5} as follows . We start with bounding E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ for a particular t and j : \end{matrix}

(A16)

\begin{matrix} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ = E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ - E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j - 1})}^{2} ∥ + E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j - 1})}^{2} ∥ \\ \overset{(a_{17})}{\leq} 2 E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j}) - \nabla F^{c} ({\bar{x}}_{C}^{t + j - 1})}^{2} ∥ + 2 E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j - 1})}^{2} ∥ \\ \overset{(a_{18})}{\leq} 2 L^{2} E ∥ {\bar{x}}_{C}^{t + j} - {\bar{x}}_{C}^{t + j - 1}^{2} ∥ + 2 E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j - 1})}^{2} ∥ \\ \overset{(a_{19})}{\leq} \underset{B}{\underset{︸}{8 L^{2} μ_{L}^{2} ((K^{2} + 5 L^{2} K^{3} μ_{L}^{2}) σ_{L}^{2} + (K^{2} + 30 L^{2} K^{4} μ_{L}^{2}) σ_{G}^{2})}} + \\ \underset{A}{\underset{︸}{(8 L^{2} μ_{L}^{2} (K^{2} + 30 L^{2} K^{3} μ_{L}^{2}) + 2)}} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j - 1})}^{2} ∥ \end{matrix}

(A17)

(A18)

The inequality

(a_{17})

is written as a consequence of having

E ∥ a^{2} ∥ - E ∥ b^{2} ∥ \leq 2 E ∥ {a - b}^{2} ∥ + E ∥ b^{2} ∥

for any random variables a and b where

a = \nabla F^{c} ({\bar{x}}_{C}^{t + j})

and

b = \nabla F^{c} ({\bar{x}}_{C}^{t + j - 1})

in the inequality. The term

(a_{18})

is written based on Assumption 1. Due to A3, A15, and that

l = 1

in this case, inequality

(a_{19})

was yielded. Thus:

\begin{matrix} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \leq B + A E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j - 1})}^{2} ∥ \\ Unrolling recursion on j, we achieve the following : \\ E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \leq B (\frac{A^{j} - 1}{A - 1}) + A^{j} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ \\ Iterating over j, we have : \\ \sum_{j = 0}^{l - 1} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \leq \underset{λ_{1}}{\underset{︸}{B \sum_{j = 0}^{l - 1} (\frac{A^{j} - 1}{A - 1})}} + \underset{λ_{2}}{\underset{︸}{(\frac{A^{l} - 1}{A - 1})}} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ \end{matrix}

(A19)

\begin{matrix} Substituting A 19 into A 16 : \\ \sum_{t \in [T]} E [∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \leq \frac{2}{μ_{L} K l} (F^{c} ({\bar{x}}_{C}^{0}) - F^{c} ({\bar{x}}_{C}^{*})) + \frac{T}{l} (5 K μ_{L}^{2} L^{2} (σ_{L}^{2} + 6 K σ_{G}^{2})) + \\ \frac{4 T}{l} (L μ_{L} + μ_{L}^{2} K L^{2} (l - 1)) ((K l + 5 L^{2} K^{2} l μ_{L}^{2}) σ_{L}^{2} + (K l + 30 L^{2} K^{3} l μ_{L}^{2}) σ_{G}^{2}) + \\ \frac{T}{l} Γ λ_{1} + Γ λ_{2} \sum_{t} E [∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \end{matrix}

(A20)

\begin{matrix} Choosing a proper μ_{L} \leq \frac{1}{l L K 2^{1.15 l + 1.85}}, we have Γ λ_{2} < 1 . Thus : \\ min_{t \in [T]} E [∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \leq \frac{2 (F^{c} ({\bar{x}}_{C}^{0}) - F^{c} ({\bar{x}}_{C}^{*}))}{(1 - Γ λ_{2}) μ_{L} K T} + \frac{5 K μ_{L}^{2} L^{2} (σ_{L}^{2} + 6 K σ_{G}^{2}))}{1 - Γ λ_{2}} + \\ \frac{4 L μ_{L} + 4 μ_{L}^{2} K L^{2} (l - 1)}{1 - Γ λ_{2}} ((K l + 5 L^{2} K^{2} l μ_{L}^{2}) σ_{L}^{2} + (K l + 30 L^{2} K^{3} l μ_{L}^{2}) σ_{G}^{2}) + \frac{Γ λ_{1}}{1 - Γ λ_{2}} \end{matrix}

(A21)

□

Theorem A2.

Under Assumptions 1, 2, 3, and partial participation of clients due to strategy one, if

μ_{L} \leq \frac{1}{l L K 2^{1.15 l + 1.85}}

, and

t \equiv 0 mod l

in Algorithm 1, the convergence rate of client model of Algorithm 1 satisfies:

\begin{matrix} min_{t \in [T]} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ \leq \frac{2 (F^{c} ({\bar{x}}_{C}^{0}) - F^{c} ({\bar{x}}_{C}^{*}))}{μ_{L} K T} + \\ (5 K μ_{L}^{2} L^{2} + 4 μ_{L}^{2} L^{2} (K^{2} l^{2} + 5 L^{2} K^{3} l^{2} μ_{L}^{2}) + L μ_{L} (\frac{1}{n} + 15 K^{2} L^{2} μ_{L}^{2})) σ_{L}^{2} + \\ (30 K^{2} μ_{L}^{2} L^{2} + L μ_{L} (90 K^{3} L^{2} μ_{L}^{2} + 3 K) + 4 μ_{L}^{2} L^{2} (K^{2} l^{2} + 30 L^{2} K^{4} l^{2} μ_{L}^{2})) σ_{G}^{2} + \\ \frac{Γ^{'} λ_{1}}{1 - Γ^{'} λ_{2}} \end{matrix}

Where

\begin{matrix} Γ^{'} = 4 μ_{L}^{2} L^{2} (K^{2} l + 30 L^{2} K^{3} l μ_{L}^{2}) + \frac{L μ_{L}}{l} (90 l K^{3} L^{2} μ_{L}^{2} + 3 K) + 30 K^{2} μ_{L}^{2} L^{2} \\ and, \\ {\bar{x}}_{C}^{*} = \underset{{\bar{x}}_{C}^{t}, t \in [T]}{argmin} E [∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \end{matrix}

Corollary A2.

Let

μ_{L} \leq \frac{1}{l L K 2^{1.15 l + 1.85} \sqrt{T}}

. Then, the convergence rate of the client-side model in Algorithm 1 is

\begin{matrix} min_{t \in [T]} E [∥ {\nabla f^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \leq O (\frac{l}{\sqrt{T}} + \frac{1}{T \sqrt{T}}) . \end{matrix}

(A22)

Proof.

In this proof, all the gradients are w.r.t.

x_{C}

. Due to Assumption 1, for any

{\bar{x}}_{C}^{t + l}

and

{\bar{x}}_{C}^{t}

such that

t \in [T]

, we can write:

\begin{matrix} F^{c} ({\bar{x}}_{C}^{t + l}) \leq F^{c} ({\bar{x}}_{C}^{t}) + \nabla F^{c} {({\bar{x}}_{C}^{t})}^{⊤} ({\bar{x}}_{C}^{t + l} - {\bar{x}}_{C}^{t}) + \frac{L}{2} ∥ {\bar{x}}_{C}^{t + l} - {\bar{x}}_{C}^{t}^{2} ∥ \end{matrix}

(A23)

Particularly, we consider the case when

t \equiv 0 mod l

from now on.

Also, note the global aggregation and client update rule in the Algorithm 1 with partial worker participation,

\begin{matrix} {\bar{x}}_{C}^{t + l} = \frac{1}{n} \sum_{i \in [S_{t + l}]} x_{C, i}^{t + l} = \frac{1}{n} \sum_{i \in [S_{t}]} x_{C, i}^{t} - \frac{μ_{L}}{n} \sum_{j = 0}^{l - 1} \sum_{i \in [S_{t + j}]} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k})) = \\ {\bar{x}}_{C}^{t} - \frac{μ_{L}}{n} \sum_{j = 0}^{l - 1} \sum_{i \in [S_{t + j}]} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k}) \end{matrix}

(A24)

Thus,

{\bar{x}}_{C}^{t + l} - {\bar{x}}_{C}^{t} = - \frac{μ_{L}}{n} \sum_{j = 0}^{l - 1} \sum_{i \in [S_{t + j}]} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k})

(A25)

In the case of partial worker participation, there are two sources of randomness. One stems from the stochastic gradient computation, while the other arises from randomly sampling the clients at round t.

Taking expectation of

F^{c} (x_{C}^{t + 1})

w.r.t. both types of randomness at round

t + l - 1

, and plugging A24 into A23 note that:

\begin{matrix} E [F^{c} ({\bar{x}}_{C}^{t + l})] \leq F^{c} ({\bar{x}}_{C}^{t}) + 〈\nabla F^{c} ({\bar{x}}_{C}^{t}), E [{\bar{x}}_{C}^{t + l} - {\bar{x}}_{C}^{t}]〉 + \frac{L}{2} E ∥ {\bar{x}}_{C}^{t + l} - {\bar{x}}_{C}^{t}^{2} ∥ \\ E [F^{c} ({\bar{x}}_{C}^{t + l})] \leq F^{c} ({\bar{x}}_{C}^{t}) + μ_{L} \underset{A_{1}^{'}}{\underset{︸}{〈\nabla F^{c} ({\bar{x}}_{C}^{t}), E [\frac{- 1}{n} \sum_{j = 0}^{l - 1} \sum_{i \in [S_{t + j}]} \sum_{k = 0}^{K - 1} (\tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k}) - \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j}))]〉}} \\ \underset{A_{2}^{'}}{\underset{︸}{- μ_{L} 〈\nabla F^{c} ({\bar{x}}_{C}^{t}), E [\frac{K}{n} \sum_{j = 0}^{l - 1} \sum_{i \in [S_{t + j}]} \nabla F_{i}^{c} ({\bar{x}}_{C}^{t + j})]〉}} + \frac{L μ_{L}^{2}}{2} \underset{A_{3}^{'}}{\underset{︸}{E [∥ {\frac{1}{n} \sum_{j = 0}^{l - 1} \sum_{i \in [S_{t + j} ∥]} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k})}^{2}]}} \end{matrix}

(A26)

According to [15, Lemma 1], terms

A_{1}^{'}

and

A_{2}^{'}

will possess the same bounds as those of

A_{1}

and

A_{2}

. Thus:

\begin{matrix} A_{1}^{'} \leq \frac{K}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ + \frac{5 K^{2} μ_{L}^{2} l L^{2}}{2} (σ_{L}^{2} + 6 K σ_{G}^{2}) + 15 K^{3} μ_{L}^{2} l L^{2} \sum_{j = 0}^{l - 1} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \\ A_{2}^{'} \leq - \frac{μ_{L} K (l + 1)}{2} ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ + \frac{μ_{L}^{3} K L^{2} (l - 1)}{2} (4 (K^{2} l^{2} + 5 L^{2} K^{3} l^{2} μ_{L}^{2}) σ_{L}^{2} \end{matrix}

(A27)

\begin{matrix} + 4 (K^{2} l^{2} + 30 L^{2} K^{4} l^{2} μ_{L}^{2}) σ_{G}^{2} + 4 (K^{2} l + 30 L^{2} K^{3} l μ_{L}^{2}) \sum_{j = 0}^{l - 1} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥) \end{matrix}

(A28)

Note that

[S_{t}] = {q_{1}^{t}, . . ., q_{n}^{t}}

is the index set demonstrating the sampled clients, which might contain duplicate elements, as the sampling is with replacement. We now proceed to bound the term

A_{3}^{'}

following [15]:

\begin{matrix} A_{3}^{'} & = E [∥ {\frac{1}{n} \sum_{j = 0}^{l - 1} \sum_{i \in [S_{t + j}]} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k})}^{2} ∥] \\ = \frac{1}{n^{2}} E [∥ {\sum_{j = 0}^{l - 1} \sum_{i = 1}^{n} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{q_{i}^{t + j}}^{c} (x_{C, q_{i}^{t + j}}^{t + j, k})}^{2} ∥] \\ \overset{a_{1}^{'}}{=} \frac{1}{n^{2}} E [∥ {\sum_{j = 0}^{l - 1} \sum_{i = 1}^{n} \sum_{k = 0}^{K - 1} (\tilde{\nabla} F_{q_{i}^{t + j}}^{c} (x_{C, q_{i}^{t + j}}^{t + j, k}) - \nabla F_{q_{i}^{t + j}}^{c} (x_{C, q_{i}^{t + j}}^{t + j, k}))}^{2} ∥] \\ + \frac{1}{n^{2}} E [∥ {\sum_{j = 0}^{l - 1} \sum_{i = 1}^{n} \sum_{k = 0}^{K - 1} \nabla F_{q_{i}^{t + j}}^{c} (x_{C, q_{i}^{t + j}}^{t + j, k})}^{2} ∥] \\ \overset{a_{2}^{'}}{\leq} \frac{l K σ_{L}^{2}}{n} + \frac{1}{n^{2}} E [∥ {\sum_{j = 0}^{l - 1} \sum_{i = 1}^{n} \sum_{k = 0}^{K - 1} \nabla F_{q_{i}^{t + j}}^{c} (x_{C, q_{i}^{t + j}}^{t + j, k})}^{2} ∥] \end{matrix}

(A29)

The equality

a_{1}^{'}

follows from the fact that

E [∥ z^{2} ∥] = E [∥ {z - E [z]}^{2} ∥] + ∥ {E [z]}^{2} ∥

and the inequality

a_{2}^{'}

is due to assumption 3 and the explanation provided in A10. Now, let’s consider

t_{i}^{j} = \sum_{k = 0}^{K - 1} \nabla F_{i}^{c} (x_{C, i}^{t + j, k})

, then:

\begin{matrix} E [∥ {\sum_{j = 0}^{l - 1} \sum_{i = 1}^{n} \sum_{k = 0}^{K - 1} \nabla F_{q_{i}^{t + j}}^{c} (x_{C, q_{i}^{t + j}}^{t + j, k})}^{2} ∥] = E [∥ {\sum_{j = 0}^{l - 1} \sum_{i = 1}^{n} t_{q_{i}^{t + j}}^{j}}^{2} ∥] \\ = E [\sum_{j = 0}^{l - 1} \sum_{i = 1}^{n} ∥ {t_{q_{i}^{t + j}}^{j}}^{2} ∥ + \sum_{j = 0}^{l - 1} \sum_{i \neq z, q_{i}^{t + j}, q_{z}^{t + j} \in [S_{t + j}]} 〈 t_{q_{i}^{t + j}}^{j}, t_{q_{z}^{t + j}}^{j} 〉] \\ \overset{a_{3}^{'}}{=} E [n \sum_{j = 0}^{l - 1} ∥ {t_{q_{1}^{t + j}}^{j}}^{2} ∥ + n (n - 1) \sum_{j = 0}^{l - 1} 〈 t_{q_{1}^{t + j}}^{j}, t_{q_{2}^{t + j}}^{j} 〉] \\ = \frac{n}{m} \sum_{j = 0}^{l - 1} \sum_{i \in [S]} ∥ {t_{i}^{j}}^{2} ∥ + \frac{n (n - 1)}{m^{2}} \sum_{j = 0}^{l - 1} \sum_{i, z \in [S]} 〈 t_{i}^{j}, t_{z}^{j} 〉 \\ = \frac{n}{m} \sum_{j = 0}^{l - 1} \sum_{i \in [S]} ∥ {t_{i}^{j}}^{2} ∥ + \frac{n (n - 1)}{m^{2}} \sum_{j = 0}^{l - 1} ∥ {\sum_{i \in [S]} t_{i}^{j}}^{2} ∥ \\ \overset{a_{4}^{'}}{\leq} \frac{n^{2}}{m} \underset{A_{4}^{'}}{\underset{︸}{\sum_{j = 0}^{l - 1} \sum_{i \in [S]} ∥ {t_{i}^{j}}^{2} ∥}} \end{matrix}

(A30)

Note that the equality

a_{3}^{'}

is due to independent sampling with replacement as outlined by [15] and

a_{4}^{'}

follows from A8. Now, we bound the term

A_{4}^{'}

as follows:

\begin{matrix} \sum_{j = 0}^{l - 1} \sum_{i \in [S]} ∥ {t_{i}^{j}}^{2} ∥ = \sum_{j = 0}^{l - 1} \sum_{i \in [S]} ∥ {\sum_{k = 0}^{K - 1} \nabla F_{i}^{c} (x_{C, i}^{t + j, k})}^{2} ∥ \overset{a_{5}^{'}}{=} K \sum_{j = 0}^{l - 1} \sum_{i \in [S]} \sum_{k = 0}^{K - 1} ∥ {\nabla F_{i}^{c} (x_{C, i}^{t + j, k})}^{2} ∥ \\ = K \sum_{j = 0}^{l - 1} \sum_{i \in [S]} \sum_{k = 0}^{K - 1} ∥ {\nabla F_{i}^{c} (x_{C, i}^{t + j, k}) - \nabla F_{i}^{c} (x_{C}^{t + j}) + \nabla F_{i}^{c} (x_{C}^{t + j}) - \nabla F^{c} (x_{C}^{t + j}) + \nabla F^{c} (x_{C}^{t + j})}^{2} ∥ \\ \overset{a_{6}^{'}}{\leq} 3 K L^{2} \sum_{j = 0}^{l - 1} \sum_{i \in [S]} \sum_{k = 0}^{K - 1} ∥ {x_{C, i}^{t + j, k} - x_{C}^{t + j}}^{2} ∥ + 3 m l K^{2} σ_{G}^{2} + 3 m K^{2} \sum_{j = 0}^{l - 1} ∥ {\nabla F^{c} (x_{C}^{t + j})}^{2} ∥ \\ \overset{a_{7}^{'}}{\leq} 15 m l K^{3} L^{2} μ_{L}^{2} (σ_{L}^{2} + 6 K σ_{G}^{2}) + 3 m l K^{2} σ_{G}^{2} \\ + (90 m l K^{4} L^{2} μ_{L}^{2} + 3 m K^{2}) \sum_{j = 0}^{l - 1} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \end{matrix}

(A31)

The term

a_{5}^{'}

follows from the fact A8, the term

a_{6}^{'}

stems from the fact A8 and the assumption 1. The term

a_{7}^{'}

is due to A11. Now, plugging A31 into A30 and A30 into A29, we have the following bound on

A_{3}^{'}

:

\begin{matrix} E [∥ {\frac{1}{n} \sum_{j = 0}^{l - 1} \sum_{i \in [S_{t + j}]} \sum_{k = 0}^{K - 1} \tilde{\nabla} F_{i}^{c} (x_{C, i}^{t + j, k})}^{2} ∥] \\ \leq (\frac{l K}{n} + 15 l K^{3} L^{2} μ_{L}^{2}) σ_{L}^{2} + (90 l K^{4} L^{2} μ_{L}^{2} + 3 l K^{2}) σ_{G}^{2} \\ + (90 l K^{4} L^{2} μ_{L}^{2} + 3 K^{2}) \sum_{j = 0}^{l - 1} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \end{matrix}

(A32)

Plugging A27, A28 and A32 into A23, with rearrangement and simplification, observe that:

\begin{matrix} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ \leq \frac{2}{μ_{L} K l} (- E [F^{c} ({\bar{x}}_{C}^{t + l})] + F^{c} ({\bar{x}}_{C}^{t})) + \\ (5 K μ_{L}^{2} L^{2} + 4 μ_{L}^{2} L^{2} (K^{2} l^{2} + 5 L^{2} K^{3} l^{2} μ_{L}^{2}) + L μ_{L} (\frac{1}{n} + 15 K^{2} L^{2} μ_{L}^{2})) σ_{L}^{2} + \\ (30 K^{2} μ_{L}^{2} L^{2} + L μ_{L} (90 K^{3} L^{2} μ_{L}^{2} + 3 K) + 4 μ_{L}^{2} L^{2} (K^{2} l^{2} + 30 L^{2} K^{4} l^{2} μ_{L}^{2})) σ_{G}^{2} + \\ \underset{Γ^{'}}{\underset{︸}{(4 μ_{L}^{2} L^{2} (K^{2} l + 30 L^{2} K^{3} l μ_{L}^{2}) + \frac{L μ_{L}}{l} (90 l K^{3} L^{2} μ_{L}^{2} + 3 K) + 30 K^{2} μ_{L}^{2} L^{2})}} \sum_{j = 0}^{l - 1} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \end{matrix}

(A33)

By iterating over t, note that,

\begin{matrix} \sum_{t \in [T]} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ \leq \frac{2}{μ_{L} K l} (- E [F^{c} ({\bar{x}}_{C}^{*})] + F^{c} ({\bar{x}}_{C}^{0})) + \\ \frac{T}{l} (5 K μ_{L}^{2} L^{2} + 4 μ_{L}^{2} L^{2} (K^{2} l^{2} + 5 L^{2} K^{3} l^{2} μ_{L}^{2}) + L μ_{L} (\frac{1}{n} + 15 K^{2} L^{2} μ_{L}^{2})) σ_{L}^{2} + \\ \frac{T}{l} (30 K^{2} μ_{L}^{2} L^{2} + L μ_{L} (90 K^{3} L^{2} μ_{L}^{2} + 3 K) + 4 μ_{L}^{2} L^{2} (K^{2} l^{2} + 30 L^{2} K^{4} l^{2} μ_{L}^{2})) σ_{G}^{2} + \\ Γ^{'} \sum_{t} \sum_{j = 0}^{l - 1} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t + j})}^{2} ∥ \end{matrix}

(A34)

Due to A19, observe that:

\begin{matrix} \sum_{t \in [T]} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ \leq \frac{2}{μ_{L} K l} (- E [F^{c} ({\bar{x}}_{C}^{*})] + F^{c} ({\bar{x}}_{C}^{0})) + \\ \frac{T}{l} (5 K μ_{L}^{2} L^{2} + 4 μ_{L}^{2} L^{2} (K^{2} l^{2} + 5 L^{2} K^{3} l^{2} μ_{L}^{2}) + L μ_{L} (\frac{1}{n} + 15 K^{2} L^{2} μ_{L}^{2})) σ_{L}^{2} + \\ \frac{T}{l} (30 K^{2} μ_{L}^{2} L^{2} + L μ_{L} (90 K^{3} L^{2} μ_{L}^{2} + 3 K) + 4 μ_{L}^{2} L^{2} (K^{2} l^{2} + 30 L^{2} K^{4} l^{2} μ_{L}^{2})) σ_{G}^{2} + \\ \frac{T}{l} Γ^{'} λ_{1} + Γ^{'} λ_{2} \sum_{t} E [∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥] \end{matrix}

(A35)

We let

μ_{L} \leq \frac{1}{l L K 2^{1.15 l + 1.85}}

, thus:

\begin{matrix} min_{t \in [T]} E ∥ {\nabla F^{c} ({\bar{x}}_{C}^{t})}^{2} ∥ \leq \frac{2 (F^{c} ({\bar{x}}_{C}^{0}) - F^{c} ({\bar{x}}_{C}^{*}))}{μ_{L} K T} + \\ (5 K μ_{L}^{2} L^{2} + 4 μ_{L}^{2} L^{2} (K^{2} l^{2} + 5 L^{2} K^{3} l^{2} μ_{L}^{2}) + L μ_{L} (\frac{1}{n} + 15 K^{2} L^{2} μ_{L}^{2})) σ_{L}^{2} + \\ (30 K^{2} μ_{L}^{2} L^{2} + L μ_{L} (90 K^{3} L^{2} μ_{L}^{2} + 3 K) + 4 μ_{L}^{2} L^{2} (K^{2} l^{2} + 30 L^{2} K^{4} l^{2} μ_{L}^{2})) σ_{G}^{2} + \\ \frac{Γ^{'} λ_{1}}{1 - Γ^{'} λ_{2}} \end{matrix}

(A36)

This completes the proof.

□

Appendix A.2. Server-Side Model Convergence

Theorem A3.

Under Assumptions 1, 2, 3, and full participation of clients, if

μ \leq \frac{1}{8 L m^{2}}

,

t \equiv 0 mod l

, the convergence rate of the server model of Algorithm 1 satisfies:

\begin{matrix} min_{t} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \leq \frac{2 l (F^{s} (x_{S}^{0}) - F^{s} (x_{S}^{*}))}{μ (2 m - 3) T} + \frac{L μ m^{2}}{2 m - 3} (9.2 σ_{L}^{2} + 13.2 σ_{G}^{2}) \end{matrix}

Corollary A3.

Let

μ \leq \frac{1}{L m^{2} \sqrt{T}}

, then the convergence rate of the server-side model is:

\begin{matrix} min_{t \in [T]} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \leq O (\frac{l}{\sqrt{T}}) \end{matrix}

Proof.

Due to Assumption 1, for any

x_{S}^{t + l}

and

x_{S}^{t}

, it can be written that:

\begin{matrix} F^{s} (x_{S}^{t + l}) \leq F^{s} (x_{S}^{t}) + \nabla F^{s} {(x_{S}^{t})}^{⊤} (x_{S}^{t + l} - x_{S}^{t}) + \frac{L}{2} ∥ {x_{S}^{t + l} - x_{S}^{t}}^{2} ∥ \end{matrix}

(A37)

Also, note the client forward-propagation and server model update rules in the Algorithm 1,

\begin{matrix} x_{S, i + 1}^{t} = x_{S, i}^{t} - μ \tilde{\nabla} F_{i}^{s} (x_{S, i}^{t}) \end{matrix}

(A38)

Thus, putting

x_{S}^{t} = x_{S, 0}^{t}

and

x_{S}^{t + l} = x_{S, m}^{t}

, note that:

\begin{matrix} 1 - 1 \\ x_{S}^{t + l} - x_{S}^{t} = - μ \sum_{i = 0}^{m - 1} \tilde{\nabla} F_{i}^{s} (x_{S, i}^{t}) \end{matrix}

(A39)

Taking expectation with respect to randomness at round t, i.e.,

ξ^{[t]} \overset{Δ}{=} {[ξ_{i}^{τ}]}_{i \in [N], τ \in [t]}

, and plugging A38 into A37 note that:

\begin{matrix} E [F^{s} (x_{S}^{t + l})] \leq F^{s} (x_{S}^{t}) - μ 〈\nabla F^{s} (x_{S}^{t}), E [\sum_{i = 0}^{m - 1} \tilde{\nabla} F_{i}^{s} (x_{S, i}^{t})]〉 + \frac{L μ^{2}}{2} E ∥ {\sum_{i = 0}^{m - 1} \tilde{\nabla} F_{i}^{s} (x_{S, i}^{t})}^{2} ∥ \\ E [F^{s} (x_{S}^{t + l})] \leq F^{s} (x_{S}^{t}) \underset{B_{1}}{\underset{︸}{- μ 〈\nabla F^{s} (x_{S}^{t}), E [\sum_{i = 0}^{m - 1} (\tilde{\nabla} F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S}^{t}))]〉}} \\ \underset{B_{2}}{\underset{︸}{- μ 〈\nabla F^{s} (x_{S}^{t}), E [\sum_{i = 0}^{m - 1} \nabla F_{i}^{s} (x_{S}^{t})]〉}} + \frac{L μ^{2}}{2} \underset{B_{3}}{\underset{︸}{E ∥ {\sum_{i = 0}^{m - 1} \tilde{\nabla} F_{i}^{s} (x_{S, i}^{t})}^{2} ∥}} \end{matrix}

(A40)

The term

B_{1}

will be bounded as follows:

\begin{matrix} - μ 〈\nabla F^{s} (x_{S}^{t}), E [E [\sum_{i = 0}^{m - 1} (\tilde{\nabla} F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S}^{t})) | ξ]]〉 \overset{(b_{1})}{=} \frac{μ}{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ + \\ \frac{μ}{2} E ∥ {\sum_{i = 0}^{m - 1} (\nabla F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S}^{t}))}^{2} ∥ - \frac{μ}{2} E ∥ {\sum_{i = 0}^{m - 1} (\nabla F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S}^{t})) + \nabla F^{s} (x_{S}^{t})}^{2} ∥ \\ \overset{(b_{2})}{\leq} \frac{μ}{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ + \frac{μ}{2} E ∥ {\sum_{i = 0}^{m - 1} (\nabla F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S}^{t}))}^{2} ∥ \\ \overset{(b_{3})}{\leq} \frac{μ}{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ + \frac{m μ}{2} \sum_{i = 0}^{m - 1} E ∥ {(\nabla F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S}^{t}))}^{2} ∥ \\ \overset{(b_{4})}{\leq} \frac{μ}{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ + \frac{m μ L^{2}}{2} \sum_{i = 0}^{m - 1} E ∥ {x_{S, i}^{t} - x_{S}^{t}}^{2} ∥ \end{matrix}

(A41)

The equality

(b_{1})

is due to

- 〈 a, b 〉 = \frac{1}{2} (∥ a^{2} ∥ + ∥ b^{2} ∥ - ∥ {a + b}^{2} ∥)

for any two vectors a and b. The inequality

(b_{2})

is clear as we dropped a negative term, inequality

(b_{3})

stems from the fact that

E ∥ {\sum_{i}^{n} z_{i}}^{2} ∥ \leq n \sum_{i} E [∥ {z_{i}}^{2} ∥]

holds for any random variable

z_{i}

, and the inequality

(b_{4})

is due to 1.

\begin{matrix} The term B_{2} will be bounded as follows : \end{matrix}

\begin{matrix} - μ 〈\nabla F^{s} (x_{S}^{t}), E [\sum_{i = 0}^{m - 1} \nabla F_{i}^{s} (x_{S}^{t})]〉 \overset{(b_{5})}{=} - m μ E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \end{matrix}

(A42)

\begin{matrix} Note that the equality (b_{5}) holds based on the definition 3 . \\ The term B_{3} will be bounded as below : \end{matrix}

\begin{matrix} E ∥ {\sum_{i = 0}^{m - 1} \tilde{\nabla} F_{i}^{s} (x_{S, i}^{t})}^{2} ∥ \overset{(b_{6})}{\leq} m \sum_{i = 0}^{m - 1} E ∥ {\tilde{\nabla} F_{i}^{s} (x_{S, i}^{t})}^{2} ∥ \\ = m \sum_{i = 0}^{m - 1} E ∥ {\tilde{\nabla} F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S, i}^{t}) + \nabla F_{i}^{s} (x_{S, i}^{t})}^{2} ∥ \\ \overset{(b_{7})}{\leq} 2 m \sum_{i = 0}^{m - 1} E ∥ {\tilde{\nabla} F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S, i}^{t})}^{2} ∥ + 2 m \sum_{i = 0}^{m - 1} E ∥ {\nabla F_{i}^{s} (x_{S, i}^{t})}^{2} ∥ \\ \overset{(b_{8})}{\leq} 2 m^{2} σ_{L}^{2} + 2 m \sum_{i = 0}^{m - 1} E ∥ {\nabla F_{i}^{s} (x_{S, i}^{t}) - \nabla F^{s} (x_{S, i}^{t}) + \nabla F^{s} (x_{S, i}^{t}) - \nabla F^{s} (x_{S}^{t}) + \nabla F^{s} (x_{S}^{t})}^{2} ∥ \\ \overset{(b_{9})}{\leq} 2 m^{2} σ_{L}^{2} + 6 m \sum_{i = 0}^{m - 1} E ∥ {\nabla F_{i}^{s} (x_{S, i}^{t}) - \nabla F^{s} (x_{S, i}^{t})}^{2} ∥ + 6 m \sum_{i = 0}^{m - 1} E ∥ {\nabla F^{s} (x_{S, i}^{t}) - \nabla F^{s} (x_{S}^{t})}^{2} ∥ + \\ 6 m \sum_{i = 0}^{m - 1} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \\ \overset{(b_{10})}{\leq} 2 m^{2} σ_{L}^{2} + 6 m^{2} σ_{G}^{2} + 6 m L^{2} \underset{B_{4}}{\underset{︸}{\sum_{i = 0}^{m - 1} E ∥ {x_{S, i}^{t} - x_{S}^{t}}^{2}}} ∥ + 6 m^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \end{matrix}

(A43)

The inequalities

(b_{6}), (b_{7}),

and

(b_{9})

due to the same reason

(b_{3})

holds above. The inequalities

(b_{8})

and

(b_{10})

hold due to Assumptions 3. The term

B_{4}

is bounded similar to [13, Lemma 3]:

\begin{matrix} E ∥ {x_{S, i}^{t} - x_{S}^{t}}^{2} ∥ \overset{(b_{11})}{\leq} E ∥ {x_{S, i - 1}^{t} ∥ - x_{S}^{t} - μ \tilde{\nabla} F_{i - 1}^{s} (x_{S, i - 1}^{t})}^{2} \\ = E ∥ {x_{S, i - 1}^{t} ∥ - x_{S}^{t}}^{2} + E ∥ {μ \tilde{\nabla} F_{i - 1}^{s} (x_{S, i - 1}^{t})}^{2} ∥ + 2 〈x_{S, i - 1}^{t} - x_{S}^{t}, - μ \tilde{\nabla} F_{i - 1}^{s} (x_{S, i - 1}^{t})〉 \\ \overset{(b_{12})}{\leq} (1 + \frac{1}{2 m - 1}) E ∥ {x_{S, i - 1}^{t} - x_{S}^{t}}^{2} ∥ + (1 + 2 m) E ∥ {μ \tilde{\nabla} F_{i - 1}^{s} (x_{S, i - 1}^{t})}^{2} ∥ \\ = (1 + \frac{1}{2 m - 1}) E ∥ {x_{S, i - 1}^{t} - x_{S}^{t}}^{2} ∥ + (1 + 2 m) μ^{2} E ∥ \tilde{\nabla} F_{i - 1}^{s} (x_{S, i - 1}^{t}) - \nabla F_{i - 1}^{s} (x_{S, i - 1}^{t}) + \\ \nabla F_{i - 1}^{s} (x_{S, i - 1}^{t}) - \nabla F_{i - 1}^{s} (x_{S}^{t}) + \nabla F_{i - 1}^{s} (x_{S}^{t}) - \nabla F^{s} (x_{S}^{t}) + \nabla F^{s} (x_{S}^{t}) ∥^{2} \\ \overset{(b_{13})}{\leq} (1 + \frac{1}{2 m - 1}) E ∥ {x_{S, i - 1}^{t} - x_{S}^{t}}^{2} ∥ + 4 (1 + 2 m) μ^{2} E ∥ {\tilde{\nabla} F_{i - 1}^{s} (x_{S, i - 1}^{t}) - \nabla F_{i - 1}^{s} (x_{S, i - 1}^{t})}^{2} ∥ + \\ 4 (1 + 2 m) μ^{2} E ∥ {\nabla F_{i - 1}^{s} (x_{S, i - 1}^{t}) - \nabla F_{i - 1}^{s} (x_{S}^{t})}^{2} ∥ + 4 (1 + 2 m) μ^{2} E ∥ {\nabla F_{i - 1}^{s} (x_{S}^{t} ∥) - \nabla F^{s} (x_{S}^{t})}^{2} \\ + 4 (1 + 2 m) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \\ \overset{(b_{14})}{\leq} (1 + \frac{1}{2 m - 1} + 4 (1 + 2 m) μ^{2} L^{2}) E ∥ {x_{S, i - 1}^{t} - x_{S}^{t}}^{2} ∥ + 4 (1 + 2 m) μ^{2} (σ_{L}^{2} + σ_{G}^{2}) + \\ 4 (1 + 2 m) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \end{matrix}

The A38 implies the inequality

(b_{11})

. The inequality

(b_{12})

holds true based on

2 〈 a, b 〉 \leq \frac{1}{n - 1} ∥ a^{2} ∥ + n ∥ b^{2} ∥

for any two vectors a, b and positive number n. The inequality

(b_{13})

follows from the previously mentioned fact at the inequality

(b_{3})

, and the inequality

(b_{14})

is based on Assumptions 3 and 1. Given

μ \leq \frac{1}{2 L (2 m + 1)}

and by averaging over the clients, observe that:

\begin{matrix} \frac{1}{m} \sum_{i = 0}^{m - 1} E ∥ {x_{S, i}^{t} - x_{S}^{t}}^{2} ∥ \leq (1 + \frac{4 m}{4 m^{2} - 1}) \frac{1}{m} \sum_{i = 1}^{m - 1} E ∥ {x_{S, i - 1}^{t} - x_{S}^{t}}^{2} ∥ + 4 (1 + 2 m) μ^{2} (σ_{L}^{2} + σ_{G}^{2}) + \\ 4 (1 + 2 m) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \\ \leq (1 + \frac{1}{m - 1}) \frac{1}{m} \sum_{i = 1}^{m - 1} E ∥ {x_{S, i - 1}^{t} - x_{S}^{t}}^{2} ∥ + 4 (1 + 2 m) μ^{2} (σ_{L}^{2} + σ_{G}^{2}) + 4 (1 + 2 m) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \end{matrix}

Unrolling the recursion, following [13, Lemma 3], it is inferred that:

\begin{matrix} \frac{1}{m} \sum_{i = 0}^{m - 1} E ∥ {x_{S, i}^{t} - x_{S}^{t}}^{2} ∥ \\ \leq \sum_{j = 0}^{m - 1} {(1 + \frac{1}{m - 1})}^{j} (4 (1 + 2 m) μ^{2} (σ_{L}^{2} + σ_{G}^{2}) + 4 (1 + 2 m) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥) \\ \leq (m - 1) \times [{(1 + \frac{1}{m - 1})}^{m} - 1] \times [4 (1 + 2 m) μ^{2} (σ_{L}^{2} + σ_{G}^{2}) + 4 (1 + 2 m) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥] \\ \leq 16 (m + 2 m^{2}) μ^{2} (σ_{L}^{2} + σ_{G}^{2}) + 16 (m + 2 m^{2}) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \end{matrix}

(A44)

Note that in the above inequality,

{(1 + \frac{1}{m - 1})}^{m} - 1 \leq 4

for

m > 1

.

\begin{matrix} Plugging A 41, A 42, A 43 and A 44 into A 40, observe that : \\ E [F^{s} (x_{S}^{t + l})] \leq F^{s} (x_{S}^{t}) + \frac{1}{2} (16 L^{2} μ^{3} m^{3} (1 + 2 m) + 2 L μ^{2} m^{2} + 96 L^{3} μ^{4} m^{3} (1 + 2 m)) σ_{L}^{2} \\ + \frac{1}{2} (16 L^{2} μ^{3} m^{3} (1 + 2 m) + 6 L μ^{2} m^{2} + 96 L^{3} μ^{4} m^{3} (1 + 2 m)) σ_{G}^{2} + \\ \frac{1}{2} (- 2 μ m + 6 L m^{2} μ^{2} + μ + 96 L^{3} μ^{4} m^{3} (1 + 2 m) + 16 L^{2} μ^{3} m^{3} (1 + 2 m)) E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \\ \overset{(b_{15})}{\leq} F^{s} (x_{S}^{t}) + \frac{μ (3 - 2 m)}{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ + \frac{L μ^{2} m^{2}}{2} (9.2 σ_{L}^{2} + 13.2 σ_{G}^{2}) \\ Rearranging the terms, and summing over t, observe that : \\ \sum_{t} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \leq \frac{2 (F^{s} (x_{S}^{0}) - F^{s} (x_{S}^{*}))}{μ (2 m - 3)} + \sum_{t} \frac{L μ m^{2}}{2 m - 3} (9.2 σ_{L}^{2} + 13.2 σ_{G}^{2}) \end{matrix}

(A45)

\begin{matrix} Assuming there are T global rounds overall, \\ min_{t} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \leq \frac{2 l (F^{s} (x_{S}^{0}) - F^{s} (x_{S}^{*}))}{μ (2 m - 3) T} + \frac{L μ m^{2}}{2 m - 3} (9.2 σ_{L}^{2} + 13.2 σ_{G}^{2}) \end{matrix}

(A46)

□

Theorem A4.

Under Assumptions 1, 2, 3, and full participation of clients, if

μ \leq \frac{1}{8 L^{2} m^{2}}

,

t \equiv 0 mod l

, the convergence rate of the server model of Algorithm 1 satisfies:

\begin{matrix} min_{t} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \leq \frac{l (F^{s} (x_{S}^{0}) - F^{s} (x_{S}^{*}))}{μ (m - 2) T} + \frac{L μ m^{2}}{m - 2} (7 σ_{L}^{2} + 7 σ_{G}^{2}) \end{matrix}

Corollary A4.

Let

μ \leq \frac{1}{L^{2} m^{2} \sqrt{T}}

, then the convergence rate of the server-side model is:

\begin{matrix} min_{t \in [T]} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \leq O (\frac{l}{\sqrt{T}}) \end{matrix}

Proof.

Due to Assumption 1, for any

x_{S}^{t + l}

and

x_{S}^{t}

, it can be written that:

\begin{matrix} F^{s} (x_{S}^{t + l}) \leq F^{s} (x_{S}^{t}) + \nabla F^{s} {(x_{S}^{t})}^{⊤} (x_{S}^{t + l} - x_{S}^{t}) + \frac{L}{2} ∥ {x_{S}^{t + l} - x_{S}^{t}}^{2} ∥ \end{matrix}

(A47)

Also, note the client forward-propagation and server model update rules in the Algorithm 1,

\begin{matrix} x_{S, i + 1}^{t} = x_{S, i}^{t} - μ \tilde{\nabla} F_{i}^{s} (x_{S, i}^{t}) \end{matrix}

(A48)

Thus, putting

x_{S}^{t} = x_{S, 0}^{t}

and

x_{S}^{t + l} = x_{S, m}^{t}

, note that:

\begin{matrix} 1 - 1 \\ x_{S}^{t + l} - x_{S}^{t} = - μ \sum_{i = 0}^{m - 1} \tilde{\nabla} F_{i}^{s} (x_{S, i}^{t}) \end{matrix}

(A49)

Taking expectation for both types of randomness, i.e., randomness due to stochastic gradients and due to sampling of clients, at round t, and plugging A48 into A47 note that:

\begin{matrix} E [F^{s} (x_{S}^{t + l})] \leq F^{s} (x_{S}^{t}) - μ 〈\nabla F^{s} (x_{S}^{t}), E [\sum_{i \in [S_{t}]} \tilde{\nabla} F_{i}^{s} (x_{S, i}^{t})]〉 + \frac{L μ^{2}}{2} E ∥ {\sum_{i \in [S_{t}]} \tilde{\nabla} F_{i}^{s} (x_{S, i}^{t})}^{2} ∥ \\ E [F^{s} (x_{S}^{t + l})] \leq F^{s} (x_{S}^{t}) \underset{B_{1}}{\underset{︸}{- μ 〈\nabla F^{s} (x_{S}^{t}), E [\sum_{i \in [S_{t}]} (\tilde{\nabla} F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S}^{t}))]〉}} \\ \underset{B_{2}}{\underset{︸}{- μ 〈\nabla F^{s} (x_{S}^{t}), E [\sum_{i \in [S_{t}]} \nabla F_{i}^{s} (x_{S}^{t})]〉}} + \frac{L μ^{2}}{2} \underset{B_{3}}{\underset{︸}{E ∥ {\sum_{i \in [S_{t}]} \tilde{\nabla} F_{i}^{s} (x_{S, i}^{t})}^{2} ∥}} \end{matrix}

(A50)

Note that

[S_{t}] = {q_{1}^{t}, . . ., q_{n}^{t}}

is the index set demonstrating the sampled clients, which might contain duplicate elements, as the sampling is with replacement. The term

B_{1}

will be bounded as follows:

\begin{matrix} - μ 〈\nabla F^{s} (x_{S}^{t}), E [E [\sum_{i \in [S_{t}]} (\tilde{\nabla} F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S}^{t})) | ξ]]〉 \overset{(b_{1})}{=} \frac{μ}{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ + \\ \frac{μ}{2} E ∥ {\sum_{i \in [S_{t}]} (\nabla F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S}^{t}))}^{2} ∥ - \frac{μ}{2} E ∥ {\sum_{i \in [S_{t}]} (\nabla F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S}^{t})) + \nabla F^{s} (x_{S}^{t})}^{2} ∥ \\ \overset{(b_{2})}{\leq} \frac{μ}{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ + \frac{μ}{2} E ∥ {\sum_{i \in [S_{t}]} (\nabla F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S}^{t}))}^{2} ∥ \\ \overset{(b_{3})}{\leq} \frac{μ}{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ + \frac{μ}{2} E ∥ {\sum_{i = 1}^{n} (\nabla F_{q_{i}^{t}}^{s} (x_{S, q_{i}^{t}}^{t}) - \nabla F_{q_{i}^{t}}^{s} (x_{S}^{t}))}^{2} ∥ \\ \overset{(b_{4})}{\leq} \frac{μ}{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ + \frac{n μ L^{2}}{2} \sum_{i = 1}^{n} E ∥ {x_{S, q_{i}^{t}}^{t} - x_{S}^{t}}^{2} ∥ \end{matrix}

(A51)

The equality

(b_{1})

is due to

- 〈 a, b 〉 = \frac{1}{2} (∥ a^{2} ∥ + ∥ b^{2} ∥ - ∥ {a + b}^{2} ∥)

for any two vectors a and b. The inequality

(b_{2})

is clear as we dropped a negative term, inequality

(b_{4})

stems from the fact that

E ∥ {\sum_{i}^{n} z_{i}}^{2} ∥ \leq n \sum_{i} E [∥ {z_{i}}^{2} ∥]

holds for any random variable

z_{i}

, and 1.

The term

B_{2}

will be bounded as follows:

\begin{matrix} - μ 〈\nabla F^{s} (x_{S}^{t}), E [\sum_{i \in [S_{t}]} \nabla F_{i}^{s} (x_{S}^{t})]〉 \overset{(b_{5})}{=} - n μ E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \end{matrix}

(A52)

\begin{matrix} Note that equality (b_{5}) holds based on the definition 3 . \\ The term B_{3} will be bounded as below : \\ E ∥ {\sum_{i \in [S_{t}]} \tilde{\nabla} F_{i}^{s} (x_{S, i}^{t})}^{2} ∥ \overset{(b_{6})}{\leq} E ∥ {\sum_{i \in [S_{t}]} \nabla F_{i}^{s} (x_{S, i}^{t})}^{2} ∥ + E ∥ {\sum_{i \in [S_{t}]} (\tilde{\nabla} F_{i}^{s} (x_{S, i}^{t}) - \nabla F_{i}^{s} (x_{S, i}^{t}))}^{2} ∥ \\ \overset{(b_{7})}{\leq} n σ_{L}^{2} + E ∥ {\sum_{i = 1}^{n} \nabla F_{q_{i}^{t}}^{s} (x_{S, q_{i}^{t}}^{t})}^{2} ∥ \end{matrix}

(A53)

Now, let’s consider

t_{i} = \nabla F_{i}^{s} (x_{S, i}^{t})

following the ideas of [15] for this part:

\begin{matrix} E [∥ {\sum_{i = 1}^{n} \nabla F_{q_{i}^{t}}^{s} (x_{S, q_{i}^{t}}^{t})}^{2} ∥] = E [∥ {\sum_{i = 1}^{n} t_{q_{i}^{t}}}^{2} ∥] \\ = E [\sum_{i = 1}^{n} ∥ {t_{q_{i}^{t}}}^{2} ∥ + \sum_{i \neq z, q_{i}^{t}, q_{z}^{t} \in [S_{t}]} 〈 t_{q_{i}^{t}}, t_{q_{z}^{t}} 〉] \\ \overset{b_{8}}{=} E [n ∥ {t_{q_{1}^{t}}}^{2} ∥ + n (n - 1) 〈 t_{q_{1}^{t}}, t_{q_{2}^{t}} 〉] \\ = \frac{n}{m} \sum_{i \in [S]} ∥ {t_{i}}^{2} ∥ + \frac{n (n - 1)}{m^{2}} \sum_{i, z \in [S]} 〈 t_{i}, t_{z} 〉 \\ = \frac{n}{m} \sum_{i \in [S]} ∥ {t_{i}}^{2} ∥ + \frac{n (n - 1)}{m^{2}} ∥ {\sum_{i \in [S]} t_{i}}^{2} ∥ \\ \overset{b_{9}}{\leq} \frac{n^{2}}{m} \underset{B_{4}}{\underset{︸}{\sum_{i \in [S]} ∥ {t_{i}}^{2} ∥}} \end{matrix}

(A54)

Note that the equality

b_{8}

is due to independent sampling with replacement as outlined by [15]. The inequality

b_{9}

follows from A8. We bound the term

B_{4}

as follows:

\begin{matrix} \sum_{i \in [S]} ∥ {t_{i}}^{2} ∥ = \\ = \sum_{i = 0}^{m - 1} E ∥ {\nabla F_{i}^{s} (x_{S, i}^{t})}^{2} ∥ \\ = \sum_{i = 0}^{m - 1} E ∥ {\nabla F_{i}^{s} (x_{S, i}^{t}) - \nabla F^{s} (x_{S, i}^{t}) + \nabla F^{s} (x_{S, i}^{t}) - \nabla F^{s} (x_{S}^{t}) + \nabla F^{s} (x_{S}^{t})}^{2} ∥ \\ \leq 3 \sum_{i = 0}^{m - 1} E ∥ {\nabla F_{i}^{s} (x_{S, i}^{t}) - \nabla F^{s} (x_{S, i}^{t})}^{2} ∥ + 3 \sum_{i = 0}^{m - 1} E ∥ {\nabla F^{s} (x_{S, i}^{t}) - \nabla F^{s} (x_{S}^{t})}^{2} ∥ + \\ 3 \sum_{i = 0}^{m - 1} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \\ \overset{(b_{10})}{\leq} 3 m σ_{G}^{2} + 3 L^{2} \underset{B_{5}}{\underset{︸}{\sum_{i = 0}^{m - 1} E ∥ {x_{S, i}^{t} - x_{S}^{t}}^{2} ∥}} + 3 m E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \end{matrix}

(A55)

The inequality

(b_{10})

holds due to Assumptions 3.

The term

B_{5}

is bounded similar to [13, Lemma 3]:

\begin{matrix} E ∥ {x_{S, i}^{t} - x_{S}^{t}}^{2} ∥ \overset{(b_{11})}{\leq} E ∥ {x_{S, i - 1}^{t} - x_{S}^{t} - μ \tilde{\nabla} F_{i - 1}^{s} (x_{S, i - 1}^{t})}^{2} ∥ \\ = E ∥ {x_{S, i - 1}^{t} - x_{S}^{t}}^{2} ∥ + E ∥ {μ \tilde{\nabla} F_{i - 1}^{s} (x_{S, i - 1}^{t})}^{2} ∥ + 2 〈x_{S, i - 1}^{t} - x_{S}^{t}, - μ \tilde{\nabla} F_{i - 1}^{s} (x_{S, i - 1}^{t})〉 \end{matrix}

\begin{matrix} \overset{(b_{12})}{\leq} (1 + \frac{1}{2 m - 1}) E ∥ {x_{S, i - 1}^{t} - x_{S}^{t}}^{2} ∥ + (1 + 2 m) E ∥ {μ \tilde{\nabla} F_{i - 1}^{s} (x_{S, i - 1}^{t})}^{2} ∥ \\ = (1 + \frac{1}{2 m - 1}) E ∥ {x_{S, i - 1}^{t} - x_{S}^{t}}^{2} ∥ + (1 + 2 m) μ^{2} E ∥ \tilde{\nabla} F_{i - 1}^{s} (x_{S, i - 1}^{t}) - \nabla F_{i - 1}^{s} (x_{S, i - 1}^{t}) + \\ \nabla F_{i - 1}^{s} (x_{S, i - 1}^{t}) - \nabla F_{i - 1}^{s} (x_{S}^{t}) + \nabla F_{i - 1}^{s} (x_{S}^{t}) - \nabla F^{s} (x_{S}^{t}) + \nabla F^{s} (x_{S}^{t}) ∥^{2} \\ \overset{(b_{13})}{\leq} (1 + \frac{1}{2 m - 1}) E ∥ {x_{S, i - 1}^{t} - x_{S}^{t}}^{2} ∥ + 4 (1 + 2 m) μ^{2} E ∥ {\tilde{\nabla} F_{i - 1}^{s} (x_{S, i - 1}^{t}) - \nabla F_{i - 1}^{s} (x_{S, i - 1}^{t})}^{2} ∥ + \\ 4 (1 + 2 m) μ^{2} E ∥ {\nabla F_{i - 1}^{s} (x_{S, i - 1}^{t}) - \nabla F_{i - 1}^{s} (x_{S}^{t})}^{2} ∥ + 4 (1 + 2 m) μ^{2} E ∥ {\nabla F_{i - 1}^{s} (x_{S}^{t} ∥) - \nabla F^{s} (x_{S}^{t})}^{2} \\ + 4 (1 + 2 m) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \\ \overset{(b_{14})}{\leq} (1 + \frac{1}{2 m - 1} + 4 (1 + 2 m) μ^{2} L^{2}) E ∥ {x_{S, i - 1}^{t} - x_{S}^{t}}^{2} ∥ + 4 (1 + 2 m) μ^{2} (σ_{L}^{2} + σ_{G}^{2}) + \\ 4 (1 + 2 m) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \end{matrix}

The A48 implies the inequality

(b_{11})

. The inequality

(b_{12})

holds true based on

2 〈 a, b 〉 \leq \frac{1}{n - 1} ∥ a^{2} ∥ + n ∥ b^{2} ∥

for any two vectors a, b and positive number n. The inequality

(b_{13})

follows from the previously mentioned fact at the inequality

(b_{3})

, and the inequality

(b_{14})

is based on Assumptions 3 and 1.

Given

μ \leq \frac{1}{2 L (2 m + 1)}

and by averaging over the clients, observe that:

\begin{matrix} \frac{1}{m} \sum_{i = 0}^{m - 1} E ∥ {x_{S, i}^{t} - x_{S}^{t}}^{2} ∥ \leq (1 + \frac{4 m}{4 m^{2} - 1}) \frac{1}{m} \sum_{i = 1}^{m - 1} E ∥ {x_{S, i - 1}^{t} - x_{S}^{t}}^{2} ∥ + 4 (1 + 2 m) μ^{2} (σ_{L}^{2} + σ_{G}^{2}) + \\ 4 (1 + 2 m) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \\ \leq (1 + \frac{1}{m - 1}) \frac{1}{m} \sum_{i = 1}^{m - 1} E ∥ {x_{S, i - 1}^{t} - x_{S}^{t}}^{2} ∥ + 4 (1 + 2 m) μ^{2} (σ_{L}^{2} + σ_{G}^{2}) + 4 (1 + 2 m) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \end{matrix}

(A56)

Unrolling the recursion, following [13, Lemma 3], it is inferred that:

\begin{matrix} \frac{1}{m} \sum_{i = 0}^{m - 1} E ∥ {x_{S, i}^{t} - x_{S}^{t}}^{2} ∥ \\ \leq \sum_{j = 0}^{m - 1} {(1 + \frac{1}{m - 1})}^{j} (4 (1 + 2 m) μ^{2} (σ_{L}^{2} + σ_{G}^{2}) + 4 (1 + 2 m) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥) \\ \leq (m - 1) \times [{(1 + \frac{1}{m - 1})}^{m} - 1] \times [4 (1 + 2 m) μ^{2} (σ_{L}^{2} + σ_{G}^{2}) + 4 (1 + 2 m) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥] \\ \leq 16 (m + 2 m^{2}) μ^{2} (σ_{L}^{2} + σ_{G}^{2}) + 16 (m + 2 m^{2}) μ^{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \end{matrix}

(A57)

Note that in the above inequality,

{(1 + \frac{1}{m - 1})}^{m} - 1 \leq 4

for

m > 1

. Plugging A51, A52, A55 and A57 into A50, observe that

\begin{matrix} E [F^{s} (x_{S}^{t + l})] \leq F^{s} (x_{S}^{t}) + \frac{1}{2} (16 L^{4} (n^{3} + 2 n^{4}) μ^{3} + L n μ^{2} + 48 L^{3} n^{2} μ^{4} (m + 2 m^{2})) σ_{L}^{2} \\ + \frac{1}{2} (16 L^{4} (n^{3} + 2 n^{4}) μ^{3} + 3 L μ^{2} n^{2} + 48 L^{3} n^{2} μ^{4} (m + 2 m^{2})) σ_{G}^{2} \\ + \frac{1}{2} (μ - 2 n μ + 16 L^{4} (n^{3} + 2 n^{4}) μ^{3} + 3 L μ^{2} n^{2} + 48 L^{3} n^{2} μ^{4} (m + 2 m^{2})) E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \\ \overset{(b_{15})}{\leq} F^{s} (x_{S}^{t}) + \frac{μ (4 - 2 m)}{2} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ + \frac{L μ m^{2}}{2} (14 σ_{L}^{2} + 14 σ_{G}^{2}) \end{matrix}

(A58)

After simplifications, the inequality

b_{15}

holds as

μ \leq \frac{1}{8 L^{2} m^{2}}

. Rearranging the terms, and summing over t, observe that:

\begin{matrix} \sum_{t} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \leq \frac{F^{s} (x_{S}^{0}) - F^{s} (x_{S}^{*})}{μ (m - 2)} + \sum_{t} \frac{L μ m^{2}}{m - 2} (7 σ_{L}^{2} + 7 σ_{G}^{2}) \end{matrix}

(A59)

Assuming there are T global rounds overall,

\begin{matrix} min_{t} E ∥ {\nabla F^{s} (x_{S}^{t})}^{2} ∥ \leq \frac{l (F^{s} (x_{S}^{0}) - F^{s} (x_{S}^{*}))}{μ (m - 2) T} + \frac{L μ m^{2}}{m - 2} (7 σ_{L}^{2} + 7 σ_{G}^{2}) \end{matrix}

(A60)

This concludes the proof. □

References

McMahan, H.B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-Efficient Learning of Deep Networks from Decentralized Data, 2023. arXiv:cs.LG/1602.05629].
Thapa, C.; Chamikara, M.A.P.; Camtepe, S. SplitFed: When Federated Learning Meets Split Learning. CoRR 2020, abs/2004.12088, [2004.12088].
Gupta, O.; Raskar, R. Distributed learning of deep neural network over multiple agents. Journal of Network and Computer Applications 2018, 116, 1–8. [Google Scholar] [CrossRef]
Mu, Y.; Shen, C. Communication and Storage Efficient Federated Split Learning. arXiv preprint 2023, arXiv:2302.05599. [Google Scholar]
Kim, M.; DeRieux, A.; Saad, W. A bargaining game for personalized, energy efficient split learning over wireless networks. 2023 IEEE Wireless Communications and Networking Conference (WCNC). IEEE, 2023, pp. 1–6.
Li, Y.; Lyu, X. Convergence Analysis of Sequential Split Learning on Heterogeneous Data. arXiv preprint 2023, arXiv:2302.01633. [Google Scholar]
Liao, Y.; Xu, Y.; Xu, H.; Yao, Z.; Wang, L.; Qiao, C. Accelerating federated learning with data and model parallelism in edge computing. IEEE/ACM Transactions on Networking 2023. [CrossRef]
Han, D.J.; Bhatti, H.I.; Lee, J.; Moon, J. Accelerating federated learning with split learning on locally generated losses. ICML 2021 workshop on federated learning for user privacy and data confidentiality. ICML Board, 2021.
Belilovsky, E.; Eickenberg, M.; Oyallon, E. Decoupled greedy learning of cnns. International Conference on Machine Learning. PMLR, 2020, pp. 736–745.
Jaderberg, M.; Czarnecki, W.M.; Osindero, S.; Vinyals, O.; Graves, A.; Silver, D.; Kavukcuoglu, K. Decoupled neural interfaces using synthetic gradients. International conference on machine learning. PMLR, 2017, pp. 1627–1635.
Ghadimi, S.; Lan, G. Stochastic first-and zeroth-order methods for nonconvex stochastic programming. SIAM Journal on Optimization 2013, 23, 2341–2368. [Google Scholar] [CrossRef]
Kairouz, P.; McMahan, H.B.; Avent, B.; Bellet, A.; Bennis, M.; Bhagoji, A.N.; Bonawitz, K.; Charles, Z.; Cormode, G.; Cummings, R.; et al. Advances and open problems in federated learning. Foundations and Trends® in Machine Learning 2021, 14, 1–210. [Google Scholar] [CrossRef]
Reddi, S.; Charles, Z.; Zaheer, M.; Garrett, Z.; Rush, K.; Konečnỳ, J.; Kumar, S.; McMahan, H.B. Adaptive federated optimization. arXiv preprint 2020, arXiv:2003.00295. [Google Scholar]
Reisizadeh, A.; Mokhtari, A.; Hassani, H.; Jadbabaie, A.; Pedarsani, R. Fedpaq: A communication-efficient federated learning method with periodic averaging and quantization. International conference on artificial intelligence and statistics. PMLR, 2020, pp. 2021–2031.
Yang, H.; Fang, M.; Liu, J. Achieving linear speedup with partial worker participation in non-iid federated learning. arXiv preprint 2021, arXiv:2101.11203. [Google Scholar]

Figure 1. CSE-FSL pipeline

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Convergence Rate Analysis of Non-I.I.D. SplitFed Learning with Partial Worker Participation and Auxiliary Networks

Abstract

Keywords:

Subject:

1. Introduction

2. Related work

3. SplitFed Learning Scenario

4. Convergence rate analysis

4.1. Client-Side Model Convergence

4.2. Server-Side Model Convergence

5. Discussion and Conclusions

5.1. Summary of Contributions

5.2. Implications

Appendix A. Proofs

Appendix A.1. Client-Side Model Convergence

Appendix A.2. Server-Side Model Convergence

References

MDPI Initiatives

Important Links

Subscribe