Fair Federated Learning

Amirreza Talebi

doi:10.20944/preprints202409.0544.v1

Submitted:

05 September 2024

Posted:

06 September 2024

You are already at the latest version

Abstract

Optimization is critical in various fields like smart vehicles, and transportation. Federated Learning (FL) has emerged as an effective approach in the coordination of autonomous vehicles, but traditional methods such as FedAvg can create performance disparities across clients. This paper addresses this fairness issue through the $q$-Fair Federated Learning ($q$-FFL) framework, adjusting model performance across clients using a tunable fairness parameter. We propose a modified FedAvg algorithm for $q$-FFL that maintains comparable convergence rates, ensuring more balanced client outcomes. Additionally, we explore incentive mechanisms in FL using a Stackelberg game model, incorporating a fairness coefficient to encourage equitable participation. Building on prior works, we redefine client utility functions to address communication and computation costs, ensuring fair resource allocation. The proposed framework achieves both global and local fairness, maintaining a unique Nash equilibrium in the modified game setting.

Keywords:

Fairness

;

Federated learning

;

Stackelberg games

;

Incentive mechanism design

Subject:

Computer Science and Mathematics - Applied Mathematics

1. Introduction

Optimization plays a crucial role in various domains such as smart homes [1,2], finance [3,4], transportation [5,6], and solar energy systems [7,8,9]. Recently, learning models have shown promising abilities to solve complex optimization problems. Among them, Most recent studies in Federated Learning (FL) have been focusing on designing optimization algorithms with proven convergence guarantee for such a finite-sum objective

min_{x \in R^{m}} f (x) ≜ \frac{1}{N} \sum_{i = 1}^{N} f_{i} (x)

(1)

where N is the number of clients in a network and

f_{i} (x) ≜ E_{ζ \sim D_{i}} [F_{i} (x; ζ_{i})]

is the expected loss of prediction of client i given the model parameter

x

and data distribution

D_{i}

[10]. However, as [11] pointed out, naively minimizing the aggregate loss function may create disparities of model performance (i.e. prediction accuracy) across different clients. For example, it could result in overfitting a model to one client at the cost of another. In that case, the prediction accuracy would be higher for some clients over the others because more data is contributed by those clients to train the model. To better describe the issue, [11] define fairness of performance distribution as the uniformity of model performance across devices. A trained model w is fairer than model

\tilde{w}

if the performance of model w on the N devices is more uniform than that of model

\tilde{w}

.

Inspired by the

α

-fairness function in resource allocation of wireless networks, [11] propose an optimization objective called q-Fair Federated Learning (q-FFL) that addresses the fairness issues, i.e. the disproportional model performance across devices. Unlike the objective in

(1)

, q-FFL penalizes the loss functions of devices with a tunable parameter q, so that the model performance across clients in the network is pushed to be uniform in any desired extent.

The authors mentioned that with q-FFL as the optimization objective, FedAvg is no longer applicable to solve the problem, as the newly defined expected loss for client i,

\frac{1}{q + 1} f_{i}^{q + 1} (x) \neq E_{ζ_{i} \sim D_{i}} [F_{i} (x; ζ_{i})]

and thus the local SGD update

x_{i}^{t} \leftarrow x_{i}^{t - 1} - γ G_{i}^{t}

in FedAvg (where

G_{i}^{t} = \nabla F_{i} (x_{i}^{t - 1}; ζ_{i})

) cannot be used in the q-FFL setting.

But if we also change the loss function from

F_{i} (x; ζ_{i})

to

\frac{1}{q + 1} F_{i}^{q + 1} (x; ζ_{i})

, and do local update as

x_{i}^{t} \leftarrow x_{i}^{t - 1} - γ H_{i}^{t}

(where

H_{i}^{t} = G_{i}^{t} F_{i}^{q} (x; ζ_{i})

), the new objective could still be fit in the FedAvg framework. (Notice that the equality

E_{ζ_{i} \sim D_{1}} [\frac{1}{q + 1} F_{i}^{q + 1} (x; ζ_{i})] = \frac{1}{q + 1} f_{i}^{q + 1} (x)

is not true in general, so we need find a proper

F_{i} (\cdot)

that satisfy this equality to make it work).

Assume we can find a suitable

F_{i} (\cdot)

, then we are able to prove that the convergence rate of FedAvg on q-FFL is comparable with using the same algorithm on the original finite-sum objective. Notice that following the standard assumptions on the objective function

f_{i} (\cdot)

, 12] have shown that FedAvg for non-convex optimization can achieve

O (1 / \sqrt{N T})

. In this work, we show that the q-Fair Federated Learning objective with a FedAvg-like algorithm can maintain the

O (1 / \sqrt{N T})

convergence using almost the same assumptions on

f_{i} (\cdot)

(instead of assumptions on

f_{i}^{q + 1} (\cdot)

).

2. q-Fair Federated Learning and Its Performance Analysis

For non-negative cost function

f_{i}

and parameter

q > 0

, the objective of q-FFL is defined as:

min_{x \in R^{m}} f_{q} (x) ≜ \frac{1}{N} \sum_{i = 1}^{N} \frac{1}{q + 1} f_{i}^{q + 1} (x)

(2)

where

f_{i}^{q + 1} (\cdot)

denote

f_{i} (\cdot)

raised to the power of

q + 1

.

q > 0

is the tunable fairness parameter. Notice that similar to the

α

-fairness framework, when

q = 0

, no fairness will be imposed and problem

(2)

is reduced to problem

(1)

. When

q = + \infty

, it corresponds to max-min fairness.

Table 1. Notation summary.

q	Fairness parameter
N	The number of clients
$D_{i}$	Set of data at client i
$ζ_{i}$	Example drawn from $D_{i}$
$F_{i} (x; ζ_{i})$	Loss on example $ζ_{i}$ at client i with model parameter $x$
$f_{i} (x)$	Expected loss at client i with model parameter $x$ , i.e. $E_{ζ_{i} \in D_{i}} [F_{i} (x; ζ_{i})]$
$g_{i} (x)$	Expected q-`FFL` loss at client i with model parameter $x$ , i.e. $\frac{1}{q + 1} f_{i}^{q + 1} f_{i} (x)$
$f (x)$	Objective function of federated learning, i.e., $\frac{1}{N} \sum_{i = 1}^{N} f_{i} (x)$
$g (x)$	Objective function of q-fair federated learning, i.e., $\frac{1}{N} \sum_{i = 1}^{N} \frac{1}{q + 1} f_{i}^{q + 1} (x)$
$G_{i}^{t}$	Stochastic gradient of $F_{i}$ on $x_{i}^{t - 1}$ with example $ζ_{i}^{t}$ , i.e., $\nabla F_{i} (x_{i}^{t - 1}; ζ_{i}^{t})$
$H_{i}^{t}$	$G_{i}^{t} F_{i}^{q} (x_{i}^{t - 1}; ζ_{i}^{t})$

Throughout this paper, we assume problem

(2)

satisfies the following assumption.

Assumption 1. We assume that

f_{i} (\cdot)

satisfies:

Smoothness: Each $f_{i} (x)$ is smooth with modulus L, i.e., for any $x, y \in R^{m}$ , $∥ \nabla f_{i} (x) - \nabla f_{i} (y) ∥ \leq L ∥ x - y ∥$ .
Bounded variances and second moments: Assume that stochastic gradient $\nabla F_{i} (\cdot)$ has bounded variance and second moments: There exists constant $σ > 0$ and $G > 0$ such that

$E_{ζ_{i} \sim D_{i}} {∥ \nabla F_{i} (x; ζ_{i}) - \nabla f_{i} (x) ∥}^{2} \leq σ^{2}, \forall x, \forall i$

$E_{ζ_{i} \sim D_{i}} {∥ \nabla F_{i} (x; ζ_{i}) ∥}^{2} \leq G^{2}, \forall x, \forall i$
Bounded loss function: There exist $M > 0$ such that $f_{i} (x) \leq M, \forall x \in X$ , where $X \in R^{m}$ is a non-empty compact set.

We assume the data is independent and identically distributed (IID). Each worker can compute unbiased stochastic gradients (on the last iteration solution

x_{i}^{t - 1}

and data sample

ζ_{i}^{t}

given by

H_{i}^{t} = G_{i}^{t} F_{i}^{q} (x_{i}^{t - 1}; ζ_{i}^{t})

For simplicity, we denote the expected loss for device i by

g_{i} (\cdot)

, i.e.,

g_{i} (x) ≜ \frac{1}{q + 1} f_{i}^{q + 1} (x)

and the expectation of the stochastic gradient is

\nabla g_{i} (x_{i}^{t - 1}) = \nabla f_{i} (x_{i}^{t - 1}) f_{i}^{q} (x_{i}^{t - 1})

where

ζ^{[t - 1]} ≜ {[ζ_{i}^{τ}]}_{i \in {1, 2, \dots, N}, τ \in {1, \dots, t - 1}}

denotes all the random samples used to calculate stochastic gradients up to iteration

t - 1

.

The FedAvg-like algorithm with q-FFL objective is described in Algorithm 1.

Algorithm 1: FedAvg-like algorithm with q-Fairness. i is the client index, K is the number of local epochs and s is the learning rate.

Algorithm 1 is essentially the same as FedAvg described in [13], except that the stochastic gradient

H_{i}^{t}

is computed on

g_{i} (\cdot)

, the expected loss for the newly defined q-FFL objective.

Fix iteration index t, we define

{\bar{x}}^{t} ≜ \frac{1}{N} \sum_{i = 1}^{N} x_{i}^{t}

as the average of local solution

x_{i}^{t}

over all N nodes. It is immediate that

{\bar{x}}^{t} = {\bar{x}}^{t - 1} - s \frac{1}{N} \sum_{i = 1}^{N} H_{i}^{t} = {\bar{x}}^{t - 1} - s \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{q} (x_{i}^{t - 1}) G_{i}^{t}

The proof starts from the descent lemma and the L-smoothness assumption. Then we bound the generated quadratic term and cross term using the same technique from [12]. And lastly, through a telescoping sum, we get an upper bound on the (average) expected square gradient norm. The following useful lemma relates client drift

E [∥ {\bar{x}}^{t} - x_{i}^{t} ∥^{2}]

and node synchronization interval K. Table 1 summarizes our notation.

Lemma 1: (Client drift) Under Assumption 1, the algorithm ensures

E [∥ x_{i}^{t} - {\bar{x}}^{t} ∥^{2}] \leq 4 s^{2} K^{2} G^{2} M^{2 q}, \forall i, \forall t

where s is the constant stepsize,

G, M

are constant defined in Assumption 1, and q is the fairness parameter.

Proof.

x_{i}^{t} = \bar{y} - s \sum_{τ = t_{0} + 1}^{t} H_{i}^{τ}

{\bar{x}}^{t} = \sum_{i = 1}^{N} x_{i}^{t} = \bar{y} - s \sum_{τ = t_{0} + 1}^{t} \frac{1}{N} \sum_{i = 1}^{N} H_{i}^{τ}

Thus, we have

\begin{matrix} E [∥ x_{i}^{t} - {\bar{x}}^{t} ∥^{2}] \\ = E [∥ s \sum_{τ = t_{0} + 1}^{t} \frac{1}{N} \sum_{i = 1}^{N} H_{i}^{τ} - s \sum_{τ = t_{0} + 1}^{t} H_{i}^{τ} ∥^{2}] \\ = s^{2} E [∥ \sum_{τ = t_{0} + 1}^{t} \frac{1}{N} \sum_{i = 1}^{N} H_{i}^{τ} - \sum_{τ = t_{0} + 1}^{t} H_{i}^{τ} ∥^{2}] \\ \leq 2 s^{2} E [∥ \sum_{τ = t_{0} + 1}^{t} \frac{1}{N} \sum_{i = 1}^{N} H_{i}^{τ} ∥^{2} + ∥ \sum_{τ = t_{0} + 1}^{t} H_{i}^{τ} ∥^{2}] \\ \leq 2 s^{2} (t - t_{0}) E [\sum_{τ = t_{0} + 1}^{t} ∥ \frac{1}{N} \sum_{i = 1}^{N} H_{i}^{τ} ∥^{2} + ∥ \sum_{τ = t_{0} + 1}^{t} H_{i}^{τ} ∥^{2}] \\ \leq 2 s^{2} (t - t_{0}) E [\sum_{τ = t_{0} + 1}^{t} (\frac{1}{N} \sum_{i = 1}^{N} ∥ H_{i}^{τ} ∥^{2}) + ∥ \sum_{τ = t_{0} + 1}^{t} H_{i}^{τ} ∥^{2}] \\ \leq 2 s^{2} (t - t_{0}) E [\sum_{τ = t_{0} + 1}^{t} (\frac{1}{N} \sum_{i = 1}^{N} f_{i}^{2 q} (x_{i}^{t - 1}) ∥ f_{i}^{q} (x_{i}^{t - 1}) G_{i}^{τ} ∥^{2}) + ∥ \sum_{τ = t_{0} + 1}^{t} f_{i}^{q} (x_{i}^{t - 1}) G_{i}^{τ} ∥^{2}] \\ \leq 4 s^{2} K^{2} G^{2} M^{2 q} \end{matrix}

□

Theorem 1. Under Assumption 1, if

0 < s \leq \frac{1}{L}

in the Algorithm, then for all

T \geq 1

, we have

\frac{1}{T} \sum_{i = 1}^{T} E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2}] \leq \frac{2}{s T} (g ({\bar{x}}^{0}) - f^{*}) + 4 s^{2} K^{2} G^{2} L^{2} M^{4 q} + \frac{L s σ^{2} M^{2 q}}{N}

where

f^{*}

is the minimum value of the problem.

Proof.

From L-smoothness and descent lemma:

E [g ({\bar{x}}^{t})] \leq E [g ({\bar{x}}^{t - 1})] + E [〈 \nabla g ({\bar{x}}^{t - 1}), {\bar{x}}^{t} - {\bar{x}}^{t - 1} 〉] + \frac{L}{2} E [∥ {\bar{x}}^{t} - {\bar{x}}^{t - 1} ∥^{2}]

(3)

For the quadratic term,

\begin{matrix} E [∥ {\bar{x}}^{t} - {\bar{x}}^{t - 1} ∥^{2}] \\ = s^{2} E [∥ \frac{1}{N} \sum_{i = 1}^{N} H_{i}^{t} ∥^{2}] \\ = \frac{s^{2}}{N^{2}} \sum_{i = 1}^{N} E [∥ H_{i}^{t} - \nabla g_{i} (x_{i}^{t - 1}) ∥^{2}] + s^{2} E [∥ \frac{1}{N} \sum_{i = 1}^{N} \nabla g_{i} (x_{i}^{t - 1}) ∥^{2}] \\ = \frac{s^{2}}{N^{2}} \sum_{i = 1}^{N} E [∥ f_{i}^{q} (x_{i}^{t - 1}) (G_{i}^{t} - \nabla f_{i} (x_{i}^{t - 1})) ∥^{2}] + s^{2} E [∥ \frac{1}{N} \sum_{i = 1}^{N} \nabla f_{i} (x_{i}^{t - 1}) f_{i}^{q} (x_{i}^{t - 1}) ∥^{2}] \\ \leq \frac{s^{2} M^{2 q}}{N^{2}} \sum_{i = 1}^{N} E [∥ G_{i}^{t} - \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] + s^{2} M^{2 q} E [∥ \frac{1}{N} \sum_{i = 1}^{N} \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] \\ \leq \frac{1}{N} s^{2} σ^{2} M^{2 q} + s^{2} M^{2 q} E [∥ \frac{1}{N} \sum_{i = 1}^{N} \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] \end{matrix}

(4)

For the cross term:

\begin{matrix} E [〈 \nabla g ({\bar{x}}^{t - 1}), {\bar{x}}^{t} - {\bar{x}}^{t - 1} 〉] \\ = - s E [〈 \nabla g ({\bar{x}}^{t - 1}), \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{q} (x_{i}^{t - 1}) \nabla f_{i} (x_{i}^{t - 1}) 〉] \\ = - \frac{s}{2} E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2} + ∥ \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{q} (x_{i}^{t - 1}) \nabla f_{i} (x_{i}^{t - 1}) ∥^{2} - ∥ \nabla g ({\bar{x}}^{t - 1}) - \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{q} (x_{i}^{t - 1}) \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] \end{matrix}

(5)

Substituting

(4)

and

(5)

into

(3)

yields

\begin{matrix} E [g ({\bar{x}}^{t})] \\ \leq E [g ({\bar{x}}^{t - 1})] - \frac{s - s^{2} L}{2} E [∥ \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{q} (x_{i}^{t - 1}) \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] - \frac{s}{2} E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2}] \\ + \frac{s}{2} E [∥ \nabla g ({\bar{x}}^{t - 1}) - \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{q} (x_{i}^{t - 1}) \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] + \frac{L s^{2} σ^{2} M^{2 q}}{2 N} \end{matrix}

(6)

\begin{matrix} E [∥ \nabla g ({\bar{x}}^{t - 1}) - \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{q} (x_{i}^{t - 1}) \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] \\ = E [∥ \frac{1}{N} \sum_{i = 1}^{N} \nabla f_{i} ({\bar{x}}^{t - 1}) f_{i}^{q} ({\bar{x}}^{t - 1}) - \frac{1}{N} \sum_{i = 1}^{N} \nabla f_{i} (x_{i}^{t - 1}) f_{i}^{q} (x_{i}^{t - 1}) ∥^{2}] \\ \leq \frac{1}{N} E [\sum_{i = 1}^{N} ∥ \nabla f_{i} ({\bar{x}}^{t - 1}) f_{i}^{q} ({\bar{x}}^{t - 1}) - \nabla f_{i} (x_{i}^{t - 1}) f_{i}^{q} (x_{i}^{t - 1}) ∥^{2}] \\ \leq M^{2 q} \frac{1}{N} E [\sum_{i = 1}^{N} ∥ \nabla f_{i} ({\bar{x}}^{t - 1}) - \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] \\ \leq M^{2 q} L^{2} \frac{1}{N} \sum_{i = 1}^{N} E [∥ {\bar{x}}^{t - 1} - x_{i}^{t - 1} ∥^{2}] \\ \leq 4 s^{2} K^{2} G^{2} L^{2} M^{4 q} \end{matrix}

(7)

Plugging

(7)

back into

(6)

and let

0 < s \leq \frac{1}{L}

, we have

E [g ({\bar{x}}^{t})] \leq E [g ({\bar{x}}^{t - 1})] - \frac{s}{2} E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2}] + 2 s^{3} K^{2} G^{2} L^{2} M^{4 q} + \frac{L s^{2} σ^{2} M^{2 q}}{2 N}

(8)

Dividing both sides of

(8)

by

s / 2

and rearranging terms yields

E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2}] \leq \frac{2}{s} (E [g ({\bar{x}}^{t - 1})] - E [g ({\bar{x}}^{t})]) + 4 s^{2} K^{2} G^{2} L^{2} M^{4 q} + \frac{L s σ^{2} M^{2 q}}{N}

(9)

Summing over

t \in {1, \dots, T}

and dividing both sides by T yields

\begin{matrix} \frac{1}{T} \sum_{i = 1}^{T} E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2}] \\ \leq \frac{2}{s T} (g ({\bar{x}}^{0}) - E [f ({\bar{x}}^{T})]) + 4 s^{2} K^{2} G^{2} L^{2} M^{4 q} + \frac{L s σ^{2} M^{2 q}}{N} \\ \leq \frac{2}{s T} (g ({\bar{x}}^{0}) - f^{*}) + 4 s^{2} K^{2} G^{2} L^{2} M^{4 q} + \frac{L s σ^{2} M^{2 q}}{N} \end{matrix}

□

The next corollary follows by substituting suitable

s, K

values into Theorem 1.

Corollary 1. Consider the problem under Assumption 1. Let

T \geq N

.

1. If we choose

s = \frac{\sqrt{N}}{L \sqrt{T}}

in the Algorithm, then we have

\frac{1}{T} \sum_{i = 1}^{T} E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2}] \leq \frac{2 L}{\sqrt{N T}} (g ({\bar{x}}^{0} - g^{*})) + \frac{4 N}{T} K^{2} G^{2} M^{2 q} + \frac{1}{\sqrt{N T}} σ^{2} M^{2 q}

.

2. If we further choose

K \leq \frac{T^{1 / 4}}{N^{3 / 4}}

, then

\frac{1}{T} \sum_{i = 1}^{T} E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2}] \leq \frac{2 L}{\sqrt{N T}} (g ({\bar{x}}^{0} - g^{*})) + \frac{4}{\sqrt{N T}} G^{2} M^{2 q} + \frac{1}{\sqrt{N T}} σ^{2} M^{2 q} = O (\frac{1}{\sqrt{N T}})

where

g^{*}

is the minimum value of the problem.

With proper assumptions, FedAvg can achieve the same convergence rate on the objective of q-fair federated learning as the original objective.

3. Fair Incentive Mechanism Design Through Stackelberg Game Settings

Most of the studies in Federated Learning (FL) have been focused on convergence rate and stationary gap of optimization algorithms. However, designing mechanisms motivating local devices known as clients in terms of FL to collaborate in global model training has not received enough attention. [14] devised an incentive mechanism in a Stackelberg game setting, and proved the existence of a unique Nash equilibrium.

In the first stage of a Stackelberg game, the parameter server (leader) strives to convince clients (followers) to participate in the training of a global model by declaring a total payment of

τ > 0

. Next in the second stage, participants decide about how much local data they are going to train. [14] proved that for any

τ

, the second-stage game has a unique Nash equilibrium, and furthermore, there exists a unique Stackelberg equilibrium for the game. The edge nodes also incur two costs, communication cost

c_{n}^{c o m}

and computation cost

c_{n}^{c m p}

given their training data set size

x_{n}

.

Based on works of [14] and [15], we add a fairness coefficient to the utility functions of clients in a federated learning incentive mechanism design with the hope that this additive term causes more fair allocation of resources to clients. [15] argues that the global fairness of a model considers the full dataset across all clients while in local fairness performance measurement, data sets are typically non-iid. To address this issue, they propose a global and local fairness metric, i.e.,

F_{g l o b a l} = P r (\hat{Y} = 1 | A = 0, Y = 1) - P r (\hat{Y} = 1 | A = 1, Y = 1)

Where

\hat{Y}

is a trained classifier and A is a group, let’s say, gender group. The ideal amount of the above metric is zero meaning that the true positive rate of the above metric should be equal regardless of the gender, male or female. The local fairness metric is defined as:

F_{k} = P r (\hat{Y} = 1 | A = 0, Y = 1, C = k) - P r (\hat{Y} = 1 | A = 1, Y = 1, C = k)

Where C denotes the

k^{t h}

client and for

F_{k}

,

k^{t h}

clients data set and distribution have been considered [15]. Furthermore, in FedAvg, [13], we have weights when doing the conventional global updates, i.e.,

w_{k}^{t} = \frac{n_{k}}{\sum_{k} n_{k}}

. Particularly, [15] discusses that these naive aggregation weights disfavour clients with lower data set sizes and they propose the following modified version of aggregation weights to achieve global fairness as follows:

w_{k}^{t} = e x p (- β | F_{k}^{t} - F_{g l o b a l}^{t} |) \frac{n_{k}}{\sum_{k} n_{k}}

We adopted this approach and integrated it into the work of [14]. Notations and proofs of the existence of Nash equilibrium in the modified version of this work have been demonstrated in what follows.

The utility of each client is modified by adding a fairness term to it, i.e.,

u (n, - n) = e x p (γ_{n}) \frac{x_{n}}{\sum_{i} x_{i}} τ - C_{n}^{c o m} x_{n} - C_{n}^{c m p} x_{n}

Where

γ_{n} = - β | F_{k}^{t} - F_{g l o b a l}^{t} |

. To maximize the utility of each client, we take the derivative of the utility function and put it equal to zero, i.e.,

u^{'} (n, - n) = e x p (γ_{n}) \frac{\sum_{i} x_{i} - x_{n}}{{(\sum_{i} x_{i})}^{2}} τ - C_{n}^{c o m} - C_{n}^{c m p} = 0

To ensure that the answers to the above equation are not saddle points, we derive the second derivative w.r.t.

x_{n}

, i.e.,

u " (n, - n) = \frac{e x p (γ_{n}) (- 2 \sum_{i} x_{i}) (\sum_{i} x_{i} - x_{n}}{{(\sum x_{i})}^{4}}) τ < 0

This means that the utility function is strictly concave and the solutions are global optimal, i.e.,

x_{n} = {({(C_{n}^{c o m} + C_{n}^{c m p})}^{- 1} τ e x p (γ_{n}) (\sum_{i \neq n} x_{i}))}^{\frac{1}{2}} - \sum_{i \neq n} x_{i}

However, there is a limitation on the amount of clients’ contributions, i.e., it cannot exceed

d_{n}

, and it depends also on the

τ

, hence, we have:

x_{n} = \{\begin{matrix} 0 & τ < \frac{(\sum_{i \neq n} x_{i}) (C_{n}^{c o m} + C_{n}^{c m p})}{e x p (γ_{n})} \\ {({(C_{n}^{c o m} + C_{n}^{c m p})}^{- 1} τ e x p (γ_{n}) (\sum_{i \neq n} x_{i}))}^{\frac{1}{2}} - \sum_{i \neq n} x_{i} & x_{n} \in (0, d_{n}) \\ d_{n} & Otherwise \end{matrix}

Then, one can derive the best response from the followers, and clients, independent of other clients, as follows:

\sum_{n} x_{n}^{*} = \sqrt{\frac{τ (\sum_{n \neq m} x_{n}^{*}) e x p (γ_{m})}{C_{m}^{c o m} + C_{m}^{c p m}}}

Then observe that,

x_{1}^{*} = a - \frac{a^{2} (C_{1}^{c o m} + C_{1}^{c m p})}{τ e x p (γ_{1}))}

Finally, considering the

M^{t h}

client,

x_{M}^{*} = a - \frac{a^{2} (C_{M}^{c o m} + C_{M}^{c m p})}{τ e x p (γ_{M}))}

Observe that summing both sides of equations from 1 to M, yields:

a = M a - \sum_{n = 1}^{M} \frac{a^{2} (C_{n}^{c o m} + C_{n}^{c m p})}{τ e x p (γ_{n})}

implying that,

a = \frac{M - 1}{\sum_{n = 1}^{M} \frac{C_{n}^{c o m} + C_{n}^{c m p}}{τ e x p (γ_{n})}}

Finally, observe that:

x_{m}^{*} = \frac{M - 1}{\sum_{n = 1}^{M} \frac{C_{n}^{c o m} + C_{n}^{c p m}}{τ (e x p (X_{n}))}} - {(\frac{M - 1}{\sum_{n = 1}^{M} \frac{C_{n}^{c o m} + C_{n}^{c p m}}{τ (e x p (X_{n}))}})}^{2} . \frac{C_{m}^{c o m} + C_{m}^{c p m}}{τ e x p (γ_{m})}

So far, we have found the best responses of followers. At this point, the leader knows that whatever the payment is there exists a unique equilibrium, the leader strives to derive the maximum profit by best tuning

τ

i.e.,

u (τ) = λ g (\sum_{n} x_{n}^{*}) - τ

is the parameter serverer utility function as defined in [14].

\frac{\partial u (τ)}{\partial τ} = λ g^{'} (X) \frac{\partial X}{\partial τ} - 1

and we have that

\frac{\partial x_{m}^{*}}{\partial τ} = (\frac{M - 1}{\sum_{n} \frac{C_{n}^{c o m} + C_{n}^{c m p}}{e x p (x_{n})}}) - {(\frac{M - 1}{\sum_{n} \frac{C_{n}^{c o m} + C_{n}^{c m p}}{e x p (x_{n})}})}^{2} \frac{C_{m}^{c o m} + C_{m}^{c m p}}{e x p (x_{m})}

And,

\frac{\partial^{2} x_{m}^{*}}{\partial τ^{2}} = 0

Hence, we have,

\frac{\partial u (τ)}{\partial τ} = λ g^{'} (x) (\sum_{m = 1}^{N} (\frac{M - 1}{\sum_{n} \frac{C_{n}^{c o m} + C_{n}^{c m p}}{e x p (x_{n})}} - {(\frac{M - 1}{\sum_{n} \frac{C_{n}^{c o m} + C_{n}^{c m p}}{e x p (x_{n})}})}^{2} \frac{C_{m}^{c o m} + C_{m}^{c m p}}{e x p (x_{m})}) - 1)

Given that

\frac{\partial^{2} g (x)}{\partial x^{2}} < 0

, we have,

\frac{\partial^{2} u (τ)}{\partial τ^{2}} = λ \frac{\partial^{2} g (x)}{\partial x^{2}} {(\sum_{m = 1}^{N} (\frac{M - 1}{\sum_{n} \frac{C_{n}^{c o m} + C_{n}^{c m p}}{e x p (x_{n})}} - (\frac{M - 1}{\sum_{n} \frac{C_{n}^{c o m} + C_{n}^{c m p}}{e x p (x_{n})}}) \frac{C_{m}^{c o m} + C_{m}^{c m p}}{e x p (x_{m})}))}^{2} < 0

Since

u (τ)

is a concave function and its value is equal to zero when

τ = 0

, then it has a unique maximize, and the proof is done.

4. Summary

In this project, we studied the fairness issue in federated learning. Two different notions of fairness were examined:

(1)

fair (uniform) model performance across local clients and

(2)

fair incentive mechanism. To achieve the first type of fairness, a novel optimization objective q-FFL was introduced. We tried to fit this federated learning task into the FedAvg framework and derived the convergence rate.

At first, we thought we had proven it by following a similar procedure from previous literature. However, after a careful examination of the new objective function, we realized that some parts in the proof procedure of [12] cannot be used in this case due to the change of objective. For example, we do not know

E_{ζ_{i} \sim D_{i}} [\frac{1}{q + 1} F_{i}^{q} (x; ζ_{i})]

when only knowing

E_{ζ_{i} \sim D_{i}} [F_{i} (x; ζ_{i})] = f_{i} (x)

. Therefore, we could not use SGD for a local update as in FedAvg. Either we need more assumptions or find a suitable structure of

F_{i} (x)

to make Theorem 1 hold.

As for the fairness in the incentive mechanism, we modeled the training process as a Stackelberg game where the server is treated as the leader and the clients are followers. Once the server announces a payment for training the global model, the followers decide the amount of data (and effort) they would like to contribute to the training task to maximize their payment. We have proven that the system could reach a unique Stackelberg equilibrium.

Appendix A

This part gives also an upper bound which is looser than the proved bounds above. The techniques used in this proof could be used whenever one would want to bind each loss function differently. Lipschitz coefficient is local and defined as follows at every point x:

L^{'} = L f^{q} (x) + q f^{q - 1} {(x) (| | \nabla f (x) | |}^{2})

[11]. They further assume that step size is

s = \frac{1}{L^{'}}

for every local agent. Either considering such a dynamic Lipschitz coefficient or such a dynamic step size makes it super difficult to simplify all the above terms. What we assume is that we have a fixed stepsize s and Lipschitz constant defined at a point such as

{\bar{\underset{̲}{x}}}^{T}

Assumption 2. Let assume that the functions

f_{i}

are bounded s.t.

m a x_{i, x} f_{i}^{q} (x) \leq M

(This could be generalized to a specific bound on each loss function)

Due to the assumption 2, now we have client drift simplified as follows:

\begin{matrix} E [∥ x_{i}^{t} - {\bar{x}}^{t} ∥^{2}] \\ \leq 2 s^{2} (t - t_{0}) E [\sum_{τ = t_{0} + 1}^{t} (\frac{1}{N} \sum_{i = 1}^{N} ∥ H_{i}^{τ} ∥^{2}) + ∥ \sum_{τ = t_{0} + 1}^{t} H_{i}^{τ} ∥^{2}] \\ \leq 2 s^{2} (t - t_{0}) E [\sum_{τ = t_{0} + 1}^{t} (\frac{1}{N} \sum_{i = 1}^{N} ∥ f_{i}^{q} (x_{i}^{τ - 1}) G_{i}^{τ} ∥^{2}) + \sum_{τ = t_{0} + 1}^{t} ∥ f_{i}^{q} (x_{i}^{τ - 1}) G_{i}^{τ} ∥^{2}] \\ \leq 2 s^{2} (t - t_{0}) M E [\sum_{τ = t_{0} + 1}^{t} (\frac{1}{N} \sum_{i = 1}^{N} ∥ G_{i}^{τ} ∥^{2}) + \sum_{τ = t_{0} + 1}^{t} ∥ G_{i}^{τ} ∥^{2}] \\ \leq 4 M^{2} s^{2} K^{2} G^{2} \end{matrix}

Previously, we showed that

\begin{matrix} E [∥ \nabla g ({\bar{x}}^{t - 1}) - \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{q} (x_{i}^{t - 1}) \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] \\ \leq \frac{1}{N} E [\sum_{i = 1}^{N} f_{i}^{2 q} ({\bar{x}}^{t - 1}) ∥ \nabla f_{i} ({\bar{x}}^{t - 1}) - \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] + \frac{1}{N} E [\sum_{i = 1}^{N} {(f_{i}^{q} ({\bar{x}}^{t - 1}) - f_{i}^{q} (x_{i}^{t - 1}))}^{2} ∥ \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] \end{matrix}

We know that

\begin{matrix} E [\sum_{i = 1}^{N} f_{i}^{2 q} ({\bar{x}}^{t - 1}) ∥ \nabla f_{i} ({\bar{x}}^{t - 1}) - \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] \\ \leq M^{2} L^{​' 2} \sum_{i = 1}^{N} E [∥ {\bar{x}}^{t - 1} - x_{i}^{t - 1} ∥^{2}] \\ \leq 4 N M^{2} s^{2} K^{2} G^{2} L^{​' 2} \end{matrix}

Subsequently,

\begin{matrix} E [∥ \nabla g ({\bar{x}}^{t - 1}) - \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{q} (x_{i}^{t - 1}) \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] \\ \leq 4 M^{2} s^{2} K^{2} G^{2} L^{​' 2} + \frac{1}{N} E [\sum_{i = 1}^{N} {(f_{i}^{q} ({\bar{x}}^{t - 1}) - f_{i}^{q} (x_{i}^{t - 1}))}^{2} ∥ \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] \end{matrix}

Observe that

f_{i}^{q} ({\bar{x}}^{t - 1}) - f_{i}^{q} {(x_{i}^{t - 1})}^{2} \leq M^{2}

Thereby,

\begin{matrix} E [∥ \nabla g ({\bar{x}}^{t - 1}) - \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{q} (x_{i}^{t - 1}) \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] \\ \leq 4 M^{2} s^{2} K^{2} G^{2} L^{​' 2} + \frac{M^{2}}{N} E [\sum_{i = 1}^{N} ∥ \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] \end{matrix}

Observe that now:

\begin{matrix} E [g ({\bar{x}}^{t})] \\ \leq E [g ({\bar{x}}^{t - 1})] - \frac{s - s^{2} L^{'}}{2} E [∥ \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{q} (x_{i}^{t - 1}) \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] - \frac{s}{2} E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2}] \\ + 2 M^{2} s^{3} K^{2} G^{2} L^{​' 2} + \frac{s M^{2}}{2 N} E [\sum_{i = 1}^{N} ∥ \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] + \frac{L^{'} s^{2} σ^{2}}{2 N^{2}} \sum_{i = 1}^{N} E [f_{i}^{2 q} (x_{i}^{t - 1})] \end{matrix}

(10)

Now, we simplify

- \frac{s - s^{2} L^{'}}{2} E [∥ \frac{1}{N} \sum_{i = 1}^{N} f_{i}^{q} (x_{i}^{t - 1}) \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}]

assuming that

s < \frac{1}{L^{'}}

. Since

s < \frac{1}{L^{'}}

, we have

\frac{1}{s} > L^{'}

, thus,

\frac{s^{2} L^{'} - s}{2} < \frac{s^{2} \frac{1}{s} - s}{2} = 0

. Consequently, by removing the negative term, we obtain:

\begin{matrix} E [g ({\bar{x}}^{t})] \\ \leq E [g ({\bar{x}}^{t - 1})] - \frac{s}{2} E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2}] + \\ 2 M^{2} s^{3} K^{2} G^{2} L^{​' 2} + \frac{s M^{2}}{2 N} E [\sum_{i = 1}^{N} ∥ \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] + \frac{L^{'} s^{2} σ^{2}}{2 N^{2}} \sum_{i = 1}^{N} E [f_{i}^{2 q} (x_{i}^{t - 1})] \\ \leq E [g ({\bar{x}}^{t - 1})] - \frac{s}{2} E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2}] + \\ 2 M^{2} s^{3} K^{2} G^{2} L^{​' 2} + \frac{s M^{2}}{2 N} E [\sum_{i = 1}^{N} ∥ \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] + \frac{L^{'} M^{2} s^{2} σ^{2}}{2 N^{2}} \end{matrix}

(11)

To simplify

\frac{s M^{2}}{2 N} E [\sum_{i = 1}^{N} ∥ \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}]

, we use

| | a - {b | |}^{2} \leq {2 | | a | |}^{2} + {2 | | b | |}^{2}

. Observe that,

\begin{matrix} \frac{s M^{2}}{2 N} E [\sum_{i = 1}^{N} ∥ \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] = \\ \frac{s M^{2}}{2 N} \sum_{i = 1}^{N} E [∥ \nabla f_{i} ({\bar{x}}^{t - 1}) - \nabla f_{i} (x_{i}^{t - 1}) - \nabla f_{i} ({\bar{x}}^{t - 1}) ∥^{2}] \\ \leq \frac{s M^{2}}{N} \sum_{i = 1}^{N} E [∥ \nabla f_{i} ({\bar{x}}^{t - 1}) - \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] + \frac{s M^{2}}{N} \sum_{i = 1}^{N} E [∥ \nabla f_{i} ({\bar{x}}^{t - 1}) ∥^{2}] \end{matrix}

(12)

Due to the client drift, we have:

\begin{matrix} \frac{s M^{2}}{N} \sum_{i = 1}^{N} E [∥ \nabla f_{i} ({\bar{x}}^{t - 1}) - \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] \\ \leq 4 L^{2} s^{3} M^{2} K^{2} G^{2} \end{matrix}

(13)

Rewriting

(11)

due to

(12)

and

(13)

, we obtain:

\begin{matrix} E [g ({\bar{x}}^{t})] \\ \leq E [g ({\bar{x}}^{t - 1})] - \frac{s}{2} E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2}] + \\ 2 M^{2} s^{3} K^{2} G^{2} L^{​' 2} + \frac{s M^{2}}{2 N} E [\sum_{i = 1}^{N} ∥ \nabla f_{i} (x_{i}^{t - 1}) ∥^{2}] + \frac{L^{'} s^{2} σ^{2}}{2 N^{2}} \sum_{i = 1}^{N} E [f_{i}^{2 q} (x_{i}^{t - 1})] \\ \leq E [g ({\bar{x}}^{t - 1})] - \frac{s}{2} E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2}] + \frac{s M^{2}}{N} \sum_{i = 1}^{N} E [∥ \nabla f_{i} ({\bar{x}}^{t - 1}) ∥^{2}] + 4 M^{2} s^{3} K^{2} G^{2} L^{2} \\ 2 M^{2} s^{3} K^{2} G^{2} L^{​' 2} + \frac{L^{'} M^{2} s^{2} σ^{2}}{2 N^{2}} \end{matrix}

(14)

Due to [16], we have:

\frac{1}{N} \sum_{i} E [| | \nabla f_{i} {(x) | |}^{2}] \leq G^{2} + B^{2} {E [| | \nabla f (x) | |}^{2}]

We rewrite

(14)

due to the above inequality, i.e.,

\begin{matrix} E [g ({\bar{x}}^{t})] \\ \leq E [g ({\bar{x}}^{t - 1})] - \frac{s}{2} E [∥ \nabla g ({\bar{x}}^{t - 1}) ∥^{2}] + s M^{2} (G^{2} + B^{2} E [| | \nabla f ({\bar{x}}^{t - 1}) {| |}^{2}]) + 4 M^{2} s^{3} K^{2} G^{2} L^{2} \\ 2 M^{2} s^{3} K^{2} G^{2} L^{​' 2} + \frac{L^{'} M^{2} s^{2} σ^{2}}{2 N^{2}} \end{matrix}

(15)

Then we know an upper bound for

E [| | \nabla f ({\bar{x}}^{t - 1}) | |^{2}]

.

By iterating over t and dividing both sides by T we obtain the upper bound □

References

Nematirad, R.; Ardehali, M.; Khorsandi, A.; Mahmoudian, A. Optimization of Residential Demand Response Program Cost with Consideration for Occupants Thermal Comfort and Privacy. IEEE Access 2024. [CrossRef]
Talebi, A. A multi-objective mixed integer linear programming model for supply chain planning of 3D printing. 2024. arXiv:2408.05213.
Varmaz, A.; Fieberg, C.; Poddig, T. Portfolio optimization for sustainable investments. Annals of Operations Research 2024, pp. 1–26.
Talebi, A.; Haeri Boroujeni, S.P.; Razi, A. Integrating random regret minimization-based discrete choice models with mixed integer linear programming for revenue optimization. Iran Journal of Computer Science 2024, pp. 1–15.
Archetti, C.; Peirano, L.; Speranza, M.G. Optimization in multimodal freight transportation problems: A Survey. European Journal of Operational Research 2022, 299, 1–20. [CrossRef]
Talebi, A. Simulation in discrete choice models evaluation: SDCM, a simulation tool for performance evaluation of DCMs. 2024. arXiv:2407.17014.
Nematirad, R.; Pahwa, A.; Natarajan, B. A Novel Statistical Framework for Optimal Sizing of Grid-Connected Photovoltaic–Battery Systems for Peak Demand Reduction to Flatten Daily Load Profiles. Solar. MDPI, 2024, Vol. 4, pp. 179–208.
Nematirad, R.; Pahwa, A.; Natarajan, B.; Wu, H. Optimal sizing of photovoltaic-battery system for peak demand reduction using statistical models. Frontiers in Energy Research 2023, 11, 1297356. [CrossRef]
Soleymani, S.; Talebi, A. Forecasting solar irradiance with geographical considerations: integrating feature selection and learning algorithms. Asian Journal of Social Science 2024, 8, 5.
Talebi, A. Convergence Rate Analysis of Non-I.I.D. SplitFed Learning with Partial Worker Participation and Auxiliary Networks. Preprints 2024. [CrossRef]
Li, T.; Sanjabi, M.; Beirami, A.; Smith, V. Fair Resource Allocation in Federated Learning. International Conference on Learning Representations, 2020.
Yu, H.; Yang, S.; Zhu, S. Parallel Restarted SGD with Faster Convergence and Less Communication: Demystifying Why Model Averaging Works for Deep Learning. Proceedings of the AAAI Conference on Artificial Intelligence 2019, 33, 5693–5700. [CrossRef]
McMahan, B.; Moore, E.; Ramage, D.; Hampson, S.; y Arcas, B.A. Communication-efficient learning of deep networks from decentralized data. Artificial intelligence and statistics. PMLR, 2017, pp. 1273–1282.
Zhan, Y.; Li, P.; Qu, Z.; Zeng, D.; Guo, S. A Learning-Based Incentive Mechanism for Federated Learning. IEEE Internet of Things Journal 2020, 7, 6360–6368. [CrossRef]
Ezzeldin, Y.H.; Yan, S.; He, C.; Ferrara, E.; Avestimehr, S. Fairfed: Enabling group fairness in federated learning. 2021. arXiv:2110.00857.
Karimireddy, S.P.; Kale, S.; Mohri, M.; Reddi, S.; Stich, S.; Suresh, A.T. Scaffold: Stochastic controlled averaging for federated learning. International Conference on Machine Learning. PMLR, 2020, pp. 5132–5143.

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).

Copyright: This open access article is published under a Creative Commons CC BY 4.0 license, which permit the free download, distribution, and reuse, provided that the author and preprint are cited in any reuse.

Fair Federated Learning

Abstract

Keywords:

Subject:

1. Introduction

2. q-Fair Federated Learning and Its Performance Analysis

3. Fair Incentive Mechanism Design Through Stackelberg Game Settings

4. Summary

Appendix A

References

MDPI Initiatives

Important Links

Subscribe