Paper 1

1
Federated Learning with Differential Privacy:

Algorithms and Performance Analysis
Kang Wei, Student Member, IEEE, Jun Li, Senior Member, IEEE, Ming Ding, Senior Member, IEEE,
Chuan Ma, Howard H. Yang, Member, IEEE, Farhad Farokhi, Senior Member, IEEE,
Shi Jin, Senior Member, IEEE, Tony Q. S. Quek, Fellow, IEEE, H. Vincent Poor, Fellow, IEEE
Abstract—Federated learning (FL), as a type of distributed fixed privacy level. Evaluations demonstrate that our theoretical
machine learning, is capable of significantly preserving clients’ results are consistent with simulations, thereby facilitating the
private data from being exposed to adversaries. Nevertheless, design of various privacy-preserving FL algorithms with different
private information can still be divulged by analyzing uploaded tradeoff requirements on convergence performance and privacy
parameters from clients, e.g., weights trained in deep neural levels.
networks. In this paper, to effectively prevent information leak-
age, we propose a novel framework based on the concept of Index Terms—Federated learning, differential privacy, conver-
differential privacy (DP), in which artificial noises are added gence performance, information leakage, client selection
to parameters at the clients’ side before aggregating, namely,
noising before model aggregation FL (NbAFL). First, we prove I. I NTRODUCTION
that the NbAFL can satisfy DP under distinct protection levels by
properly adapting different variances of artificial noises. Then we With AlphaGo’s success, it is expected that big data-
develop a theoretical convergence bound of the loss function of driven artificial intelligence (AI) will soon be applied in many
the trained FL model in the NbAFL. Specifically, the theoretical aspects of our daily life, including medical care, agriculture,
bound reveals the following three key properties: 1) There is transportation systems, etc. At the same time, the rapid growth
a tradeoff between a convergence performance and privacy
protection levels, i.e., better convergence performance leads to a of Internet-of-Things (IoT) applications calls for data mining
lower protection level; 2) Given a fixed privacy protection level, and learning securely and reliably in distributed systems [1]–
increasing the number N of overall clients participating in FL [3]. When integrating AI in a variety of IoT applications,
can improve the convergence performance; and 3) There is an distributed machine learning (ML) is preferred for many data
optimal number aggregation times (communication rounds) in processing tasks by defining parametrized functions from
terms of convergence performance for a given protection level.
Furthermore, we propose a K-client random scheduling strategy, inputs to outputs as compositions of basic building blocks [4],
where K (1 ≤ K < N ) clients are randomly selected from the N [5]. Federated learning (FL), as a recent advance in distributed
overall clients to participate in each aggregation. We also develop ML, was proposed, in which data are acquired and processed
a corresponding convergence bound for the loss function in this locally at the client side, and then the updated ML parameters
case and the K-client random scheduling strategy also retains are transmitted to a central server for aggregation [6]–[8].
the above three properties. Moreover, we find that there is an
optimal K that achieves the best convergence performance at a The goal of FL is to fit a model generated by an empirical
risk minimization (ERM) objective. However, FL also poses
several key challenges, such as private information leakage,
This work is supported in part by the National Key Research and Develop-
ment Program under Grant 2018YFB1004800, the National Natural Science expensive communication costs between servers and clients,
Foundation of China (Grant No. 61872184 and Grant No. 61727802), the and device variability [9]–[15].
SUTD Growth Plan Grant for AI, the U.S.National Science Foundation under Generally, distributed stochastic gradient descent (SGD) is
Grant CCF-1908308, and the Princeton Center for Statistics and Machine
Learning under a Data X Grant. Corresponding author: Jun Li and Chuan Ma adopted in FL for training ML models. In [16], [17], bounds
(e-mail:{jun.li, chuan.ma}@njust.edu.cn). for FL convergence performance were developed based on
Jun Li is with the School of Electronic and Optical Engineering, Nanjing distributed SGD, with a one-step local update before global
University of Science and Technology, Nanjing, 210094, CHINA. He is also
with the School of Computer Science and Robotics, National Research Tomsk aggregation. The work in [18] considered partially global
Polytechnic University, Tomsk, 634050, RUSSIA. Email: jun.li@njust.edu.cn. aggregation, where after each local update step, parameter ag-
Chuan Ma and Kang Wei are with School of Electrical and Optical gregation is performed over a non-empty subset of the clients
Engineering, Nanjing University of Science and Technology, Nanjing 210094,
China (e-mail:{kang.wei, chuan.ma}@njust.edu.cn). set. In order to analyze the convergence, federated proximal
Ming Ding is with Data61, CSIRO, Sydney, NSW 2015, Australia (e- (FedProx) was proposed [19] by adding regularization on each
mail:ming.ding@data61.csiro.au). local loss function. The work in [20] obtained a convergence
Farhad Farokhi is with the Department of Electrical and Electronic Engi-
neering, the University of Melbourne, Melbourne, VIC 3010, Australia. When bound for SGD based FL that incorporates non-independent-
working on this paper, he was also affiliated with CSIROs Data61, Australia and-identically-distributed (non-i.i.d.) data distributions among
(e-mail:ffarokhi@unimelb.edu.au). clients.
Howard H. Yang and Tony Q. S. Quek are with the Information System
Technology and Design Pillar, Singapore University of Technology and At the same time, with the ever increasing awareness of
Design, Singapore (e-mail:{howard yang, tonyquek}@sutd.edu.sg). data security of personal information, privacy preservation
Shi Jin is with the National Mobile Communications Research Laboratory, has become a worldwide and significant issue, especially for
Southeast University, Nanjing 210096, China (e-mail: jinshi@seu.edu.cn).
H. Vincent Poor is with the Department of Electrical Engineering, Princeton big data applications and distributed learning systems. One
University, Princeton, NJ 08544 USA (e-mail: poor@princeton.edu). prominent advantage of FL is that it enables local training
Digital Object Identifier: 10.1109/TIFS.2020.2988575
1556-6021 c 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
2
without personal data exchange between the server and clients, of this work. Compared with conventional works, such as [35],
thereby protecting clients’ data from being eavesdropped upon [36], which focus mainly on simulation results, our theoretical
by hidden adversaries. Nevertheless, private information can performance analysis is more efficient to find the optimal
still be divulged to some extent analyzing the differences of parameters, e.g., the number of chosen clients K and the
parameters trained and uploaded by the clients, e.g., weights number of maximum aggregation times T , to achieve the
trained in neural networks [21]–[24]. minimum loss function.
A natural approach to preventing information leakage is In this paper, to effectively prevent information leakage, we
to add artificial noise, known as differential privacy (DP) propose a novel framework based on the concept of DP, in
techniques [25], [26]. Existing works on DP based learning which each client perturbs its trained parameters locally by
algorithms include local DP (LDP) [27]–[29], DP based dis- purposely adding noise before uploading them to the server
tributed SGD [30], [31] and DP meta learning [32]. In LDP, for aggregation, namely, noising before model aggregation FL
each client perturbs its information locally and only sends a (NbAFL). To the best of the authors’ knowledge, this is the
randomized version to a server, thereby protecting both the first piece of work of its kind that theoretically analyzes the
clients and server against private information leakage. The convergence properties of differentially private FL algorithms.
work in [28] proposed solutions to building up an LDP- The main contributions of this paper are summarized as
compliant SGD, which powers a variety of important ML follows:
tasks. The work in [29] considered the distributed estimation at
the server over uploaded data from clients while providing pro- • We prove that the proposed NbAFL scheme satisfies the
tections on these data with LDP. The work in [33] introduced requirement of DP in terms of global data under a certain
an algorithm for user-level differentially private training of noise perturbation level with Gaussian noises by properly
large neural networks, in particular a complex sequence model adapting their variances.
for next-word prediction. The work in [34] developed a chain • We develop a convergence bound on the loss function of
abstraction model on tensors to efficiently override operations the trained FL model in the NbAFL with artificial Gaus-
(or encode new ones) such as sending/sharing a tensor between sian noises. Our developed bound reveals the following
workers, and then provided the elements to implement recently three key properties: 1) There is a tradeoff between the
proposed DP and multiparty computation protocols using this convergence performance and privacy protection levels,
framework. The work in [30] improved the computational i.e., a better convergence performance leads to a lower
efficiency of DP based SGD by tracking detailed information protection level; 2) Increasing the number N of overall
about the privacy loss, and obtained accurate estimates of the clients participating in FL can improve the convergence
overall privacy loss. The work in [31] proposed novel DP performance, given a fixed privacy protection level; and
based SGD algorithms and analyzed their performance bounds 3) There is an optimal number of maximum aggregation
which were shown to be related to privacy levels and the times in terms of convergence performance for a given
sizes of datasets. Also, the work in [32] focused on the class protection level.
of gradient-based parameter-transfer methods and developed • We propose a K-client random scheduling strategy, where
a DP based meta learning algorithm that not only satisfies K (1 ≤ K < N ) clients are randomly selected from
the privacy requirement but also retains provable learning the N overall clients to participate in each aggregation.
performance in convex settings. We also develop a corresponding convergence bound on
More specifically, DP based FL approaches are usually the loss function in this case. From our analysis, the K-
devoted to capturing the tradeoff between privacy and conver- client random scheduling strategy retains the above three
gence performance in the training process. The work in [35] properties. Also, we find that there exists an optimal value
proposed an FL algorithm with the consideration on preserving of K that achieves the best convergence performance at
clients’ privacy. This algorithm can achieve good training a fixed privacy level.
performance at a given privacy level, especially when there • We conduct extensive simulations based on real-world
is a sufficiently large number of participating clients. The datasets to validate the properties of our theoretical bound
work in [36] presented an alternative approach that utilizes in NbAFL. Evaluations demonstrate that our theoretical
both DP and secure multiparty computation (SMC) to prevent results are consistent with simulations. Therefore, our
differential attacks. However, the above two works on DP- analytical results are helpful for the design on privacy-
based FL design have not taken into account privacy protec- preserving FL architectures with different tradeoff re-
tion during the parameter uploading stage, i.e., the clients’ quirements on convergence performance and privacy lev-
private information can be potentially intercepted by hidden els.
adversaries when uploading the training results to the server. The remainder of this paper is organized as follows. In Sec-
Moreover, these two works only showed empirical results tion II, we introduce background on FL, DP and a conventional
using simulations, but lacked theoretical analysis on the FL DP-based FL algorithm. In Section III, we detail the proposed
system, such as the tradeoffs between privacy, convergence NbAFL and analyze the privacy performance based on DP.
performance, and convergence rate. To the authors’ knowl- In Section IV, we analyze the convergence bound of NbAFL
edge, a theoretical analysis on convergence behavior of FL and reveal the relationship between privacy levels, convergence
with privacy-preserving noise perturbations has not yet been performance, the number of clients, and the number of global
considered in existing studies, which will be the major focus aggregations. In Section V, we propose the K-client random
3
TABLE I: Summary of Main Notation

where wi is the parameter vector trained at the i-th client, w
M A randomized mechanism for DP is the parameter vector after aggregating atPthe server, N is
D, D ′ Adjacent databases N
ǫ, δ The parameters related to DP the number of clients, pi = |D i|
|D| ≥ 0 with i=1 pi = 1, and
Ci The i-th client PN
|D| = i=1 |Di | is the total size of all data samples. Such an
Di The database held by the owner Ci
D The database held by all the clients optimization problem can be formulated as
|·| The cardinality of a set N
N Total number of all clients X
K The number of chosen clients (1 < K < N ) w∗ = arg min pi Fi (w, Di ), (2)
w
t The index of the t-th aggregation i=1
T The number of aggregation times
w The vector of model parameters where Fi (·) is the local loss function of the i-th client.
F (w) Global loss function Generally, the local loss function Fi (·) is given by local
Fi (w) Local loss function from the i-th client empirical risks. The training process of such a FL system
µ A presetting constant of the proximal term
(t) usually contains the following four steps:
wi Local uploading parameters of the i-th client
w(0) Initial parameters of the global model • Step 1: Local training: All active clients locally compute
w(t) Global parameters generated from all local parameters training gradients or parameters and send locally
at the t-th aggregation trained ML parameters to the server;
v(t) Global parameters generated from K clients’ parameters
at the t-th aggregation
• Step 2: Model aggregating: The server performs secure
w∗ True optimal model parameters that minimize F (w) aggregation over the uploaded parameters from N
clients without learning local information;
• Step 3: Parameters broadcasting: The server broadcasts
the aggregated parameters to the N clients;
• Step 4: Model updating: All clients update their respective
models with the aggregated parameters and test the
performance of the updated models.
In the FL process, the N clients with the same data structure
collaboratively learn a ML model with the help of a cloud
server. After a sufficient number of local training and update
exchanges between the server and its associated clients, the
solution to the optimization problem (2) is able to converge
to that of the global optimal learning model.
Fig. 1: A FL training model with hidden adversaries who can eavesdrop trained
parameters from both the clients and the server. B. Threat Model
The server in this paper is assumed to be honest. However,
there are external adversaries targeting at clients’ private
scheduling scheme and develop the convergence bound. We information. Although the individual dataset Di of the i-
show the analytical results and simulations in Section VI. th client is kept locally in FL, the intermediate parameter
We conclude the paper in Section VII. A summary of basic wi needs to be shared with the server, which may reveal
concepts and notations is provided in Tab. I. the clients’ private information as demonstrated by model
inversion attacks. For example, authors in [37] demonstrated
II. P RELIMINARIES a model-inversion attack that recovers images from a facial
recognition system. In addition, the privacy leakage can also
In this section, we will present preliminaries and related happen in the broadcasting (through downlink channels) phase
background knowledge on FL and DP. Also, we introduce the by analyzing the global parameter w.
threat model that will be discussed in our following analysis. We also assume that uplink channels are more secure than
downlink broadcasting channels, since clients can be assigned
A. Federated Learning to different channels (e.g., time slots, frequency bands) dy-
namically in each uploading time, while downlink channels
Let us consider a general FL system consisting of one server are broadcasting. Hence, we assume that there are at most L
and N clients, as depicted in Fig. 1. Let Di denote the local (L ≤ T ) exposures of uploaded parameters from each client
database held by the client Ci , where i ∈ {1, 2, . . . , N }. At the in the uplink1 and T exposures of aggregated parameters in
server, the goal is to learn a model over data that resides at the the downlink, where T is the number of aggregation times.
N associated clients. An active client, participating in the local
training, needs to find a vector w of an AI model to minimize C. Differential Privacy
a certain loss function. Formally, the server aggregates the
weights received from the N clients as (ǫ, δ)-DP provides a strong criterion for privacy preservation
of distributed data processing systems. Here, ǫ > 0 is the
N
X 1 Here we assume that the adversary cannot know where the parameters
w= pi w i , (1)
come from.
i=1
4
distinguishable bound of all outputs on neighboring datasets where Di is the i-th client’s database and Di,j is the j-th
Di , Di′ in a database, and δ represents the event that the ratio sample in Di . Thus, the sensitivity of sD i
U can be expressed as
of the probabilities for two adjacent datasets Di , Di′ cannot be
D′
bounded by eǫ after adding a privacy preserving mechanism. ∆sD Di
U = max′ ksU − sU k
i i
Di ,Di
With an arbitrarily given δ, a privacy preserving mechanism
with a larger ǫ gives a clearer distinguishability of neighboring |Di |
1 X
datasets and hence a higher risk of privacy violation. Now, we = max′ arg min Fi (w, Di,j )
Di ,Di
|Di | j=1 w
will formally define DP as follows.

|Di′ |
Definition 1: ((ǫ, δ)-DP [25]): A randomized mechanism M : 1 X 2C
− ′ arg min Fi (w, Di,j )
′
= , (5)
X → R with domain X and range R satisfies (ǫ, δ)-DP, if for |Di | j=1 w
|Di |
all measurable sets S ⊆ R and for any two adjacent databases
Di , Di′ ∈ X , where Di′ is an adjacent dataset to Di which has the same size
′
but only differ by one sample, and Di,j is the j-th sample in
Pr[M(Di ) ∈ S] ≤ eǫ Pr[M(Di′ ) ∈ S] + δ. (3) ′
Di . From the above result, a global sensitivity in the uplink
channel can be defined by
For numerical data, a Gaussian mechanism defined in [25] n o
can be used to guarantee (ǫ, δ)-DP. According to [25], we ∆sU , max ∆sD U
i
, ∀i. (6)
present the following DP mechanism by adding artificial
Gaussian noises. To achieve a small global sensitivity, the ideal condition is
In order to ensure that the given noise distribution n ∼ that all the clients use sufficient local datasets for training.
N (0, σ 2 ) preserves (ǫ, δ)-DP, where N represents the Gaus- Hence, we define the minimum size of the local datasets by
m and then obtain ∆sU = 2C m . To ensure (ǫ, δ)-DP for each
p we choose noise scale σ ≥ c∆s/ǫ and the
sian distribution,
client in the uplink in one exposure, we set the noise scale,
constant c ≥ 2 ln(1.25/δ) for ǫ ∈ (0, 1). In this result,
n is the value of an additive noise sample for a data in represented by the standard deviation of the additive Gaussian
the dateset, ∆s is the sensitivity of the function s given noise, as σU = c∆sU /ǫ. Considering L exposures of local
by ∆s = maxDi ,Di′ ks(Di ) − s(Di′ )k, and s is a real-valued parameters, we need to set σU = cL∆sU /ǫ due to the linear
function. relation between ǫ and σU in the Gaussian mechanism.
From the downlink perspective, the aggregation operation
Considering the above DP mechanism, choosing an appro-
for Di can be expressed as
priate level of noise remains a significant research problem,
which will affect the privacy guarantee of clients and the
sD
D
i
, w = p 1 w 1 + . . . + p i w i + . . . + pN w N , (7)
convergence rate of the FL process.
where 1 ≤ i ≤ N and w is the aggregated parameters at the
server to be broadcast to the clients. Regarding the sensitivity
III. F EDERATED L EARNING WITH D IFFERENTIAL P RIVACY of sD Di
D , i.e., ∆sD , we have the following lemma.
i
In this section, we first introduce the concept of global Lemma 1 (Sensitivity for the aggregation operation):
DP and analyze the DP performance in the context of FL. In FL training process, the sensitivity for Di after the
Then we propose the NbAFL scheme that can satisfy the DP aggregation operation sD i
D is given by
requirement by adding proper noisy perturbations at both the
2Cpi
clients and the server. ∆sD
D =
i
. (8)
m
Proof: See Appendix A.
A. Global Differential Privacy Remark 1: From the above lemma, to achieve a small global
Here, we define a global (ǫ, δ)-DP requirement for both sensitivity in the downlink channel which is defined by
uplink and downlink channels. From the uplink perspective, n o
Di 2Cpi
using a clipping technique, we can ensure that kwi k ≤ C, ∆sD , max ∆sD = max , ∀i, (9)
m
where wi denotes training parameters from the i-th client with-
out perturbation and C is a clipping threshold for bounding the ideal condition is that all the clients should use the same
wi . We assume that the batch size in the local training is size of local datasets for training, i.e., pi = 1/N .
equal to the number of training samples and then define local From the above remark, when setting pi = 1/N , ∀i, we
training process in the i-th client by can obtain the optimal value of the sensitivity ∆sD . So here
we should add noise at the client side first and then decide
sD
U , wi = arg min Fi (w, Di )
i
whether or not to add noises at server to satisfy the (ǫ, δ)-DP
w
|Di | criterion in the downlink channel.
1 X
= arg min Fi (w, Di,j ), (4) Theorem 1 (DP guarantee for downlink channels): To en-
|Di | j=1 w
sure (ǫ, δ)-DP in the downlink channels with T aggregations,
5
Algorithm 1: Noising before Aggregation FL termination conditions. After completing the local training,
(0)
Data: T , w , µ, ǫ and δ the i-th client, ∀i, will add noises to the trained parameters
(t) (t)
1
(0)
Initialization: t = 1 and wi = w(0) , ∀i wi , and upload the noised parameters w e i to the server for
2 while t ≤ T do aggregation.
3 Local training process: Then the server update the global parameters w(t) by
4 while Ci ∈ {C1 , C2 , . . . , CN } do aggregating the local parameters integrated with different
(t)
(t) weights. Additive noises nD are added to this w(t) according
5 Update the local parameters wi as
to Theorem 1 before being broadcast to the clients. Based on
6
(t) the received global parameters w e (t) , each client will estimate
wi = arg min Fi (wi ) + µ2 kwi − w(t−1) k2 the accuracy by using local testing databases and start the next
wi
7 Clip the local parameters
round of training process based on these received parameters.
(t) (t)
(t)
kw k The FL process completes after the aggregation time reaches
wi = wi / max 1, Ci
a preset number T and the algorithm returns w e (T ) .
8 Add noise and upload parameters Now, let us focus on the privacy preservation performance
(t) (t) (t)
e i = wi + n i
w of the NbAFL. First, the set of all local parameters are
9 Model aggregating process: received by the server. Owing to the local perturbations in the
NbAFL, it will be difficult for malicious adversaries to infer
10 Update the global parameters w(t) as
PN the information at the i-client from its uploaded parameters
(t)
11 w(t) = ei
pi w we i . After the model aggregation, the aggregated parameters
i=1 w will be sent back to clients via broadcast channels. This
12 The server broadcasts global noised parameters
(t) poses threats on clients’s privacy as potential adversaries may
13 we (t) = w(t) + nD reveal sensitive information about individual clients from w.
14 Local testing process: In this case, additive noises may be posed to w based on
15 while Ci ∈ {C1 , C2 , . . . , CN } do Theorem 1.
16 Test the aggregating parameters w e (t) using local
dataset
17 t←t+1 IV. C ONVERGENCE A NALYSIS ON N BAFL
e (T )
Result: w In this section, we are ready to analyze the convergence
performance of the proposed NbAFL. First, we analyze the
expected increment of adjacent aggregations in the loss func-
the standard deviation of Gaussian noises nD that are added tion with Gaussian noises. Then, we focus on deriving the
to the aggregated parameter w by the server can be given as convergence property under the global (ǫ, δ)-DP requirement.
( √ 2 2 √ For the convenience of the analysis, we make the following
2cC T −L N
T > L N, assumptions on the loss function and network parameters.
σD = mN ǫ √ (10)
0 T ≤ L N.
Assumption 1: We make assumptions PN on the global loss
Proof: See Appendix B. function F (·) defined by F (·) , i=1 pi Fi (·), and the i-th
Theorem 1 shows that to satisfy a (ǫ, δ)-DP requirement local loss function Fi (·) as follows:
for the downlink channels, additional noises nD need to be 1) Fi (w) is convex;
added by the server. With a certain L, the standard deviation 2) Fi (w) satisfies the Polyak-Lojasiewicz condition with
of additional noises is depending on the relationship between the positive parameter l, which implies that F (w) −
the number of aggregation times T and the number of clients F (w∗ ) ≤ 2l1
k∇F (w)k2 , where w∗ is the optimal result;
N . The intuition is that a larger T can lead to a higher chance 3) F (w ) − F (w∗ ) = Θ;
(0)
of information leakage, while a larger number of clients is 4) Fi (w) is β-Lipschitz, i.e., kFi (w) − Fi (w′ )k ≤ βkw −
helpful for hiding their private information. This theorem also w′ k, for any w, w′ ;
provides the variance value of the noises that should be added 5) Fi (w) is ρ-Lipschitz smooth, i.e., k∇Fi (w) −
to the aggregated parameters. Based on the above results, we ∇Fi (w′ )k ≤ ρkw − w′ k, for any w, w′ , where ρ is a
propose the following NbAFL algorithm. constant determined by the practical loss function;
6) For any i and w, k∇Fi (w) − ∇F (w)k ≤ εi , where εi
B. Proposed NbAFL is the divergence metric.
Algorithm 1 outlines our NbAFL for training an effective Similar to the gradient divergence, the divergence metric εi
model with a global (ǫ, δ)-DP requirement. We denote by µ is the metric to capture the divergence between the gradients
the presetting constant of the proximal term and by w(0) the of the local loss functions and that of the aggregated loss
initiate global parameter. At the beginning of this algorithm, function, which is essential for analyzing SGD. The divergence
the server broadcasts the required privacy level parameters is related to how the data is distributed at different nodes.
(ǫ, δ) are set and the initiate global parameter w(0) are sent to Using Assumption 1 and assume k∇F (w)k to be uniformly
clients. In the t-th aggregation, N active clients respectively away from zero, we then have the following lemma.
train the parameters by using local databases with preset
6
Lemma 2 (B-dissimilarity of various clients): For a given Theorem 2 reveals an important relationship between pri-
ML parameter w, there exists B satisfying vacy and utility by taking into account the protection level
ǫ and the number of aggregation times T . As the number
E k∇Fi (w)k2 ≤ k∇F (w)k2 B 2 , ∀i. (11) of aggregation times T increases, the first term of the upper
Proof: See Appendix C. bound decreases but the second term increases. Furthermore,
Lemma 2 comes from the assumption of the divergence By viewing T as a continuous variable and by writing the
metric and demonstrates the statistical heterogeneity of all RHS of (14) as h(T ), we have
clients. As mentioned earlier, the values of ρ and B are deter-
mined by the specific global loss function F (w) in practice d2 h(T ) κ1 T κ0 T 2
= Θ− − 2 P T ln2 P
and the training parameters w. With the above preparation, we d2 T ǫ ǫ

are now ready to analyze the convergence property of NbAFL. κ1 2κ0 T 2κ0
−2 + 2 P T ln P + 2 1 − P T . (15)
First, we present the following lemma to derive an expected ǫ ǫ ǫ
increment bound on the loss function during each iteration of
parameters with artificial noises. It can be seen that the second term and third term of on the
RHS of (15) are always positive. When N and ǫ are set to
Lemma 3 (Expected increment in the loss function): be large enough, we can see that κ1 and κ0 are small, and
After receiving updates, from the t-th to the (t + 1)-th thus the first term can also be positive. In this case, we have
aggregation, the expected difference in the loss function can d2 h(T )/d2 T > 0 and the upper bound is convex for T .
be upper-bounded by
Remark 2: As can be seen from this theorem, the expected
e
E{F (w (t+1)
) − F (w (t)
e )} ≤ λ2 E{k∇F (w
e )k } (t) 2 gap between the achieved loss function F (w e (T ) ) and the min-
(t+1) imum one F (w∗ ) is a decreasing function of ǫ. By increasing
+ λ1 E{kn e (t) )k} + λ0 E{kn(t+1) k2 },
kk∇F (w (12)
ǫ, i.e., relaxing the privacy protection level, the performance
2
where λ0 = ρ2 , λ1 = µ1 + ρB 1 ρB ρB (t) of NbAFL algorithm will improve. This is reasonable because
µ , λ2 = − µ + µ2 + 2µ2 and n
are the equivalent noises imposed on the parameters after the the variance of artificial noises decreases, thereby improving
PN (t) (t) the convergence performance.
t-th aggregation, given by n(t) = i=1 pi ni + nD .
Proof: See Appendix D. Remark 3: The number of clients N will also affect its
In this lemma, the value of an additive noise sample n iterative convergence performance, i.e., a larger N would
in vector n(t) satisfies the following Gaussian achieve a better convergence performance. This is because
p distribution a lager N leads to a lower variance of the artificial noises.
n ∼ N (0, σA2 ). Also, we can obtain σA = σD2 + σU2 /N
from Section III. From the right hand side (RHS) of the above Remark 4: There is an optimal number of maximum ag-
inequality, we can see that it is crucial to select a proper gregation times T in terms of convergence performance for
proximal term µ to achieve a low upper-bound. It is clear given ǫ and N . In more detail, a larger T may lead to a
that artificial noises with a large σA may improve the DP higher variance of artificial noises, and thus pose a negative
performance in terms privacy protection. However, from the impact on convergence performance. On the other hand, more
RHS of (12), a large σA may enlarge the expected difference iterations can generally boost the convergence performance if
of the loss function between two consecutive aggregations, noises are not large enough. In this sense, there is a tradeoff
leading to a deterioration of convergence performance. on choosing a proper T .
Furthermore, to satisfy the global (ǫ, δ)-DP, by using The-
orem 1, we have
( √ V. K-C LIENT R ANDOM S CHEDULING P OLICY
cT ∆sD
T > L N,
σA = cL∆sU ǫ √ (13) In this section, we consider the case where only K(K < N )
√ T ≤ L N.
Nǫ clients are selected to participate in the aggregation process,
Next, we will analyze the convergence property of NbAFL namely K-client random scheduling.
with the (ǫ, δ)-DP requirement. We now discuss how to add artificial noises in the K-client
random scheduling to satisfy a global (ǫ, δ)-DP. It is obvious
Theorem 2 (Convergence upper bound of the NbAFL): that in the uplink channels, each of the K scheduled clients
With required protection level ǫ, the convergence upper bound should add noises with scale σU = cL∆sU /ǫ for achieving
of Algorithm 1 after T aggregations is given by (ǫ, δ)-DP. This is equivalent to the noise scale in the all-
clients selection case in Section III, since each client only
e (T ) ) − F (w∗ )} ≤
E{F (w
considers its own privacy for uplink channels in both cases.
κ1 T κ0 T 2
However, the derivation of the noise scale in the downlink
PTΘ + + 2 1 − P T , (14)
ǫ ǫ will be different for the K-client random scheduling. As an
q extension of Theorem 1, we present the following lemma in
λ1 βcC 2
where P = 1 + 2lλ2 , κ1 = m(1−P ) N π and κ0 = the case of K-client random scheduling on how to obtain σD .
λ 0 c2 C 2
m2 (1−P )N . Lemma 4 (DP guarantee for K-client scheduling): In the
Proof: See Appendix D. NbAFL algorithm with K-client random scheduling, to satisfy
7
a global (ǫ, δ)-DP, and the standard deviation σD of additive

2.4
Gaussian noises for downlink channels should be set as = 50
 q 2 2.2 = 60
 2cC Tb2 −L2 K
σD = mKǫ T > γǫ , (16) 2
= 100
Value of the Loss Function

Non-private
0 T ≤ γǫ , 1.8

N −ǫ
where b = − Tǫ ln 1 − K N
+K eT and γ = 1.6
−ǫ

√ 1.4
− ln 1 − K N + Ne
K L K
.
1.2
Proof: See Appendix F.
1
Lemma 4 recalculates σD by considering the number of
chosen clients K. Generally, the number of clients N is fixed, 0.8
we thus focus on the effect of K. Based on the DP analysis 0.6
in Lemma 4, we can obtain the following theorem. 1 4 7 10 13 16 19 22 25
Aggregation Time (t)
Theorem 3 (Convergence under K-client scheduling): Fig. 2: The comparison of training loss with various protection levels for 50 clients using
With required protection level ǫ and the number of chosen ǫ = 50, ǫ = 60 and ǫ = 100, respectively.
clients K, for any Θ > 0, the convergence upper bound after
T aggregation times is given by
2.4
vT ) − F (w∗ )} ≤ QT Θ
E{F (e = 6
r 2.2 = 8
1 − QT cCα1 β 2 = 10
+ 2

Non-private
1−Q N N − Tǫ π
−mK ln 1 − K +K e (17)
! 1.8
2 2
c C α0 1.6
+ 2 2 2 N N − Tǫ
.
m K ln 1 − K +K e 1.4
2 2 2

where Q = 1 + µ2l2 ρB2 + ρB + ρBK +
2ρB
√
K
+ µB
√
K
− µ , 1.2
√
2ρK
α0 = + ρ, α1 = 1 + 2ρB µ +
2ρB K
µN ,
e (T ) =
and v 1
PK
N
(T ) (T ) (T ) 0.8
i=1 pi wi + ni + nD .
0.6
Proof: See Appendix G. 0 5 10 15 20 25
Aggregation Time (t)
The above theorem provides the convergence upper bound
between F (e vT ) and F (w∗ ) under K-random scheduling. Us- Fig. 3: The comparison of training loss with various protection level for 50 clients using
ǫ = 6, ǫ = 8 and ǫ = 10, respectively.
ing K-client random scheduling, we can obtain an important
relationship between privacy and utility by taking into account
the protection level ǫ, the number of aggregation times T and hidden units. In this feed-forward neural network, we use a
the number of chosen clients K. ReLU units and softmax of 10 classes (corresponding to the
Remark 5: From the bound derived in Theorem 3, we con- 10 digits). For the optimizer of networks, we set the learning
clude that there is an optimal K in between 0 and N that rate to 0.002. Then, we evaluate this MLP for the multi-class
achieves the optimal convergence performance. That is, by classification task with the standard MNIST dataset, namely,
finding a proper K, the K-client random scheduling policy recognizing from 0 to 9, where each client has 100 training
is superior to the one that all N clients participate in the FL samples locally. This setting is in line with the ideal condition
aggregations. in Remark 1.
We can note that parameter clipping C is a popular ingredi-
VI. S IMULATION R ESULTS ent of SGD and ML for non-privacy reasons. A proper value of
clipping threshold C should be considered for the DP based FL
In this section, we evaluate the proposed NbAFL by us- framework. In the following experiments (except subsection
ing multi-layer perception (MLP) and real-world federated B), we utilize the method in [29] and choose C by taking
datasets. In order to characterize the convergence property of the median of the norms of the unclipped parameters over the
NbAFL, we conduct experiments by varying the protection course of training. The values of ρ, β, l and B are determined
levels of ǫ, the number of clients N , the number of maximum by the specific loss function, and we will use estimated values
aggregation times T and the number of chosen clients K. in our simulations [20].
We conduct experiments on the standard MNIST dataset
for handwritten digit recognition consisting of 60000 training
examples and 10000 testing examples [38]. Each example is A. Performance Evaluation on Protection Levels
a 28 × 28 size gray-level image. Our baseline model uses a In Figs. 2, we choose various protection levels ǫ = 50,
a MLP network with a single hidden layer containing 256 ǫ = 60 and ǫ = 100 to show the results of the loss function in
8
2 2.2
= 50 C = 10
1.8 = 60 2.1 C = 15
= 100 C = 20

1.6 Non-private C = 25
2
1.4
1.9
1.2
1.8
1
1.7
0.8
0.6 1.6
0.4 1.5
1 3 5 7 9 11 13 15 17 19 21 23 25 0 5 10 15 20 25
Aggregation Time (t) Aggregation Time (t)
Fig. 4: The comparison of training loss with various privacy levels for 50 clients using Fig. 5: The comparison of training loss with various clipping thresholds for 50 clients
ǫ = 50, ǫ = 60 and ǫ = 100, respectively. using ǫ = 60.
NbAFL. Furthermore, we also include a non-private approach 2.2

N = 50
to compare with our NbAFL. In this experiment, we set N = 60
2
N = 50, T = 25 and δ = 0.01, and compute the values N = 80

N = 100
of the loss function as a function of the aggregation times t. 1.8
As shown in Fig. 2, values of the loss function in NbAFL are
decreasing as we relax the privacy guarantees (increasing ǫ). 1.6
Such observation results are in line with Remark 2. We also
choose high protection levels ǫ = 6, ǫ = 8 and ǫ = 10 for 1.4
this experiment, where each client has 512 training samples

1.2
locally. We set N = 50, T = 25 and δ = 0.01. From Fig. 3,
we can draw a similar conclusion as in Remark 2 that values 1
of the loss function in NbAFL are decreasing as we relax the
privacy guarantees. 0.8
1 3 5 7 9 11 13 15 17 19 21 23 25
Considering the K-client random scheduling, in Fig. 4, we Aggregation Time (t)
investigate the performances with various protection levels Fig. 6: The value of the loss function with various numbers of clients under ǫ = 60
ǫ = 50, ǫ = 60 and ǫ = 100. For simulation parameters, under NbAFL Algorithm with 50 clients.
we set N = 50, K = 20, T = 25, and δ = 0.01. As shown
in Figs. 4, the convergence performance under the K-client
random scheduling is improved with an increasing ǫ. governed by Remark 3. This is because that more clients
not only provide larger global datasets for training, but also
bring down the standard deviation of additive noises due to
B. Impact of the Clipping Threshold C the aggregation.
In Fig. 5, we choose various clipping thresholds C =
10, 15, 20 and 25 to show the results of the loss function for
D. Impact of the Number of Maximum Aggregation Times T
50 clients using ǫ = 60 in NbAFL. As shown in Fig. 5,
when C = 20, the convergence performance of NbAFL In Fig. 7, we show the experimental results of training loss
can obtain the best value. We can note that limiting the as a function of maximum aggregation times with various
parameter norm has two opposing effects. On the one hand, privacy levels ǫ = 50, 60, 80 and 100 under NbAFL algorithm.
if the clipping threshold C is too small, clipping destroys the This observation is in line with Remark 4, and the reason
intended gradient direction of parameters. On the other hand, comes from the fact that a lower privacy level decreases the
increasing the norm bound C forces us to add more noise to standard deviation of additive noises and the server can obtain
the parameters because of its effect on the sensitivity. better quality ML model parameters from the clients. Fig. 7
also implies that an optimal number of maximum aggregation
times increases almost with respect to the increasing ǫ.
C. Impact of the Number of Clients N In Fig. 9, we plot the values of the loss function in
Figs. 6 compares the convergence performance of NbAFL the normalized NbAFL using solid lines and the K-random
under required protection level ǫ = 60 and δ = 10−2 as a scheduling based NbAFL using dotted lines with various num-
function of clients’ number, N . In this experiment, we set bers of maximum aggregation times. This figure shows that
N = 50, N = 60, N = 80 and N = 100. We notice the value of loss function is a convex function of maximum
that the performance among different numbers of clients is aggregation times for a given protection level under NbAFL
9
6 3 = 50
= 50 = 60
= 60 = 100

5 = 100 2.5 Non-private
Non-private
4 2
3
1.5
2
1
1
0.5
10 15 20 25 30 35 40 45 50
0
5 10 15 20 25 30 Number of the Chosen Clients (K)
Number of Maximum Aggregation Times (T) Fig. 10: The value of the loss function with various numbers of chosen clients under
Fig. 7: The convergence upper bounds with various privacy levels ǫ = 50, 60 and 100 ǫ = 50, 60, 100 under NbAFL Algorithm and non-private approach with 50 clients.
under 50-clients’ NbAFL algorithm.
algorithm, which validates Remark 4. From Fig. 9, we can

also see that for a given ǫ, K-client random scheduling based
NbAFL algorithm has a better convergence performance than
3.5
= 60 (Theoretical Results)
the normalized NbAFL algorithm for a larger T . This is
= 60 (Simulation Results) because that K-client random scheduling can bring down the
3 = 100 (Theoretical Results)
variance of artificial noises with little performance loss.
= 100 (Simulation Results)
2.5
E. Impact of the Number of Chosen Clients K
2 In Fig. 10, we plot values of the loss function with various
numbers of chosen clients K under the random scheduling
1.5 policy in NbAFL. The number of clients is N = 50, and
K clients are randomly chosen to participate in training
1 and aggregation in each iteration. In this experiment, we set
ǫ = 50, ǫ = 60, ǫ = 100 and δ = 0.01. Meanwhile, we
0.5
5 10 15 20 25 30
also exhibit the performance of the non-private approach with
Number of Maximum Aggregation Times (T) various numbers of chosen clients K. Note that an optimal K
Fig. 8: The comparison of the loss function between experimental and theoretical results which further improves the convergence performance exists for
with the various aggregation times under NbAFL Algorithm with 50 clients. various protection levels, due to a trade-off between enhance
privacy protection and involving larger global training datasets
in each model updating round. This observation is in line
with Remark 5. The figure shows that in NbAFL, for a given
2
protection level ǫ, the K-client random scheduling can obtain
= 60 (N=K=50) a better tradeoff than the normal selection policy.
1.8 = 100 (N=K=50)
Non-private (N=K=50)
1.6 = 60 (N=50, K=30)
= 100 (N=50, K=30) VII. C ONCLUSIONS

1.4 Non-private (N=50, K=30)
In this paper, we have focused on information leakage
1.2 in SGD based FL. We have first defined a global (ǫ, δ)-
1
DP requirement for both uplink and downlink channels, and
developed variances of artificial noises at clients and server
0.8 sides. Then, we have proposed a novel framework based on
0.6 the concept of global (ǫ, δ)-DP, named NbAFL. We have de-
veloped theoretically a convergence bound on the loss function
0.4
of the trained FL model in the NbAFL. From theoretical
0.2 convergence bounds, we have obtained the following results:
10 15 20 25 30 35 40
Number of Maximum Aggregation Times (T) 1) There is a tradeoff between the convergence performance
Fig. 9: The value of the loss function with various privacy levels ǫ = 60 and ǫ = 100 and privacy protection levels, i.e., better convergence perfor-
under NbAFL Algorithm with 50 clients. mance leads to a lower protection level; 2) Increasing the
number N of overall clients participating in FL can improve
10
PN
the convergence performance, given a fixed privacy protec- The distribution φN (n) of i=1 pi ni can be expressed as
tion level; and 3) There is an optimal number of maximum N
aggregation times in terms of convergence performance for O
φN (n) = ϕi (n), (23)
a given protection level. Furthermore, we have proposed a i=1
K-client random scheduling strategy and also developed a N
corresponding convergence bound on the loss function in this where pi ni ∼ ϕi (n), and is convolutional operation.
case. In addition to the above three properties. we find that When we use Gaussian mechanism for ni with noise scale
there exists an optimal value of K that achieves the best σU , the distribution of pi ni is also Gaussian distribution. To
convergence performance at a fixed privacy level. Extensive √ ∆sD , we set pi = 1/N . Furthermore,
obtain a small sensitivity
simulation results confirm the correctness of our analysis. the noise scale σU / N of the Gaussian distribution φN (n)
Therefore, our analytical results are helpful for the design can be calculated. To ensure a global (ǫ, δ)-DP in downlink
on privacy-preserving FL architectures with different tradeoff channels, we know the standard deviation of additive noises
requirements on convergence performance and privacy levels. can be set to σA = cT ∆sD /ǫ, where ∆sD = 2C/mN . Hence,
We can note that the size and the distribution of data both we can obtain the standard deviation of additive noises at the
greatly affect the quality of the FL training. As a future work, server as
it is of great interest to analytically evaluate the convergence r ( √ 2 2 √
2cC T −L N
2 σU2 T > L N,
performance of NbAFL with varying size and distribution of σD = σA − = mN ǫ √ (24)
data at client sides. N 0 T ≤ L N.
Hence, Theorem 1 has been proved.
A PPENDIX A
P ROOF OF L EMMA 1 A PPENDIX C
From the downlink perspective, for all Di and which Di′ P ROOF OF L EMMA 2
differ in a signal entry, the sensitivity can be expressed as Due to Assumption 1, we have
Di′
∆sD Di
D = max′ ksD − sD k.
i
(18) E k∇Fi (w) − ∇F (w)k2 ≤ E{ε2i } (25)
Di ,Di
and
Based on (4) and (7), we have
E k∇Fi (w) − ∇F (w)k2
sD
D = p1 w1 (D1 ) + . . . + pi wi (Di ) + . . . + pN wN (DN ) (19)
i
= E k∇Fi (w)k2 − 2E ∇Fi (w)⊤ ∇F (w) (26)

and + k∇F (w)k2 = E k∇Fi (w)k2 − k∇F (w)k2 .
D′
sD i = p1 w1 (D1 )+. . .+pi wi (Di′ )+. . .+pN wN (DN ), (20) Considering (25), (26) and ∇F (w) = E{∇Fi (w)}, we have

Furthermore, the sensitivity can be given as E k∇Fi (w)k2 ≤ k∇F (w)k2 + E{ε2i }
(27)
= k∇F (w)k2 B(w)2 .
∆sD ′
D = max′ kpi wi (Di ) − pi wi (Di )k
i
Di ,Di Note that when k∇F (w)k2 6= 0, there exists

2Cpi s
pi max′ kwi (Di ) − wi (Di′ )k = pi ∆sD
U ≤
i
. (21) E{ε2i }
Di ,Di m B(w) = 1 + ≥ 1, (28)
k∇F (w)k2
2Cpi
Hence, we know ∆sD
D =
i
m . This completes the proof.
which satisfies the equation. We can notice that a smaller value
of B(w) implies that the local loss functions are more locally
A PPENDIX B similar. When all the local loss functions are the same, then
P ROOF OF T HEOREM 1 B(w) = 1, for all w. Therefore, we can have
To ensure a global (ǫ, δ)-DP in the uplink channels, the
E k∇Fi (w)k2 ≤ k∇F (w)k2 B 2 , ∀i, (29)
standard deviation of additive noises in client sides can be
set to σU = cL∆sU /ǫ due to the linear relation between ǫ where B is the upper bound of B(w). This completes the
and σU with Gaussian mechanism, where ∆sU = 2C m is the proof.
sensitivity for the aggregation operation and m is the data size
of each client. We then set the sample in the i-th local noise
A PPENDIX D
vector to a same distribution ni ∼ ϕ(n) (i.i.d for all i) because
P ROOF OF L EMMA 3
each client is coincident with the same global (ǫ, δ)-DP. The
aggregation process with artificial noises added by clients can Considering the aggregation process with artificial noises
be expressed as added by clients and the server in the (t + 1)-th aggregation,
we have
N
X N
X N
X XN
(t+1)
e =
w pi (wi + ni ) = pi w i + pi n i . (22) e (t+1) =
w pi w i + n(t+1) , (30)
i=1 i=1 i=1 i=1
11
where Therefore,
N
X
n(t) =
(t) (t)
pi n i + n D . (31) e (t+1) − w
kw e (t) k ≤ kw(t+1) − w
e (t) k + kn(t+1) k
(t+1)
i=1 ≤ E{kwi −we (t) k} + kn(t+1) k
Because Fi (·) is ρ-Lipschitz smooth, we know 1+θ (42)
≤ e (t) )k} + kn(t+1) k
E{k∇Fi (w
µ
e (t+1) ) ≤ Fi (w
Fi ( w e (t) ) + ∇Fi (w
e (t) )⊤ (w
e (t+1) − w e (t) ) B(1 + θ)
≤ e (t) )k + kn(t+1) k.
k∇F (w
ρ (t+1) µ
+ kw e −w e (t) k2 , (32)
2 Using (37) and (38), we know
e (t+1) , w
for all w e (t) . Combining F (w
e (t) ) = E{Fi (w
e (t) )} and kE{∇Fi (wi
(t+1)
e (t) ) − E{∇J(wi
)} − ∇F (w
(t+1)
e (t) )}k
;w
(t) (t)
∇F (w e ) = E{∇Fi (w e )}, we have (t+1) (t+1)
≤ ρE{kwi −we (t) k} + E{k∇J(wi e (t) )k}
;w
e (t+1) )−F (w
E{F (w e (t) )} ≤ E{∇F (w
e (t) )⊤ (w
e (t+1) −w e (t) )} ρB(1 + θ)
≤ e (t) )k + Bθk∇F (w
k∇F (w e (t) )k.
ρ µ
+ E{kwe (t+1) − w e (t) k2 }. (33) (43)
2
Substituting (37), (42) and (43) into (33), we know
We define
e (t+1) ) − F (w
E{F (w e (t) )}
(t+1) (t+1) µ (t+1)
J(wi e (t) )
;w , Fi (wi )+ kwi e (t) k2 .
−w (34) (t) ⊤ 1 1
2 ≤ E ∇F (w e ) − ∇F (w e (t) ) + n(t+1)
µ µ
Then, we know
ρB(1 + θ) Bθ (t)
+ + k∇F (w e )k
(t+1) (t+1) µµ µ
∇J(wi e (t) ) = ∇Fi (wi
;w ) ( 2 )
ρ B(1 + θ) (t) (t+1)
(t+1)
+ µ wi e (t)
−w (35) + E k∇F (w e )k + kn k . (44)
2 µ
and Then, using triangle inequation, we can obtain
N
X e (t+1) ) − F (w
E{F (w e (t) )} ≤ λ2 k∇F (w
e (t) )k2
(t+1) (t) (t+1) (t+1) (t+1) (t) (45)
e
w −w
e = wi + ni + nD −w
e + λ1 E{kn(t+1) k}k∇F (w
e (t) )k + λ0 E{kn(t+1) k2 }.
i=1
1 where
(t+1) (t+1)
= E{∇J(wi e (t) ) − ∇Fi (wi
;w )} + n(t+1) . (36) 1 B ρ(1 + θ) ρB 2 (1 + θ)2
µ λ2 = − + +θ + , (46)
µ µ µ 2µ2
Because Fi (·) is ρ-Lipschitz smooth, we can obtain
1 ρB(1 + θ) ρ
λ1 = + and λ0 = . (47)
(t+1)
E{∇Fi (wi )} ≤ E{∇Fi (w e (t) ) + ρkwi
(t+1)
−w e (t) k} µ µ 2
(t+1) In this convex case, where µ = µ, if θ = 0, all subproblems
= e (t) ) + ρE{kwi
∇F (w −w e (t) k}. (37) ρB 2
are solved accurately. We know λ2 = − µ1 + ρB
µ2 + 2µ2 , λ1 =
1 ρB ρ
Now, let us bound kwi
(t+1)
e (t) k. We know
−w µ + µ and λ0 = 2 . This completes the proof.
(t+1) (t+1) (t+1) (t+1)

kwi e (t) k ≤ kwi
−w −w
bi −w e (t) k,
k + kw
bi A PPENDIX E
(38) P ROOF OF T HEOREM 2
(t+1)
where w bi e (t) ). Let us define µ =
= arg minw Ji (w; w We assume that F satisfies the Polyak-Lojasiewicz inequal-
µ + l > 0, then we know Ji (w; we (t) ) is µ-convexity. Based ity [39] with positive parameter l, which implies that
on this, we can obtain 1
e (t) ) − F (w∗ )} ≤
E{F (w k∇F (w e (t) )k2 . (48)
(t+1) (t+1) θ 2l
kw
bi − wi e (t) )k
k ≤ k∇Fi (w (39)
µ Moreover, subtract E{F (w∗ )} in both sides of (45), we know
and e (t+1) ) − F (w∗ )}
E{F (w
(t+1) 1
(t)
kw
bi e (t) )k,
e k ≤ k∇Fi (w
−w (40) e (t) ) − F (w∗ )} + λ2 k∇F (w
≤ E{F (w e (t) )k2
µ
+ λ1 E{kn(t+1) k}k∇F (w
e (t) )k + λ0 E{kn(t+1) k2 }. (49)
where θ denotes a θ solution of minw Ji (w; w e (t) ), which is
defined in [19]. Now, we can use the inequality (39) and (40) Considering k∇F (w(t) )k ≤ β and (48), we have
to obtain
e (t+1) )−F (w∗ )} ≤ (1+2lλ2 )E{F (w
E{F (w e (t) )−F (w∗ )}
(t+1) 1+θ
kwi e (t) k ≤
−w e (t) )k.
k∇Fi (w (41) + λ1 βE{kn(t+1) k} + λ0 E{kn(t+1) k2 }, (50)
µ
12
where F (w∗ ) is the loss function corresponding to the optimal Considering the independence of adding noises, we know
parameters w∗ . Considering the same and independent distri- 2
!
2n∆sD +k∆sD k
−
bution of additive noises, we define E{kn(t) k} = E{knk} T ln 1 − q + qe 2σ 2
A ≥ −ǫ. (56)
and E{kn(t) k2 } = E{knk2 }, for 0 ≤ t ≤ T . Applying (50)
recursively, we have We can obtain the result

e
E{F (w (T ) ∗ T
)−F (w )} ≤ (1+2lλ2 ) E{F (w (0) ∗
)−F (w )} σ2 exp(− Tǫ ) 1 ∆sD
n ≤ − A ln − +1 − . (57)
T
X −1 ∆sD q q 2
t
+ λ1 βE{knk} + λ0 E{knk2 } (1 + 2lλ2 ) We set
t=0 T exp(−ǫ/T ) − 1
b=− ln +1 . (58)
= (1 + 2lλ2 )T E{F (w(0) ) − F (w∗ )} ǫ q
(1 + 2lλ2 )T − 1 Hence,
+ λ1 βE{knk} + λ0 E{knk2 } . (51) exp(−ǫ/T ) − 1 bǫ
2lλ2 ln +1 =− . (59)
√ q T
If T ≤ L N and then σD = 0, this case √ is special. Hence, Note that ǫ and T should satisfy
we will consider the condition that T > L N . Based on (13),
we have σA = ∆sD T c/ǫ. Hence, we can obtain −ǫ
ǫ < −T ln (1 − q) or T > . (60)
r ln (1 − q)
∆sD T c 2N ∆s2D T 2 c2 N Then,
E{knk} = and E{knk2 } = .
ǫ π ǫ2 σA2 bǫ ∆sD
(52) n≤ − . (61)
T ∆sD 2
2 2
Substituting (52) into (51), setting ∆sD = 1/mN and Using the tail bound Pr[n > η] ≤ √σ2π
A 1 −η /2σA
ηe , we can
F (w(0) ) − F (w∗ ) = Θ, we have obtain r !

η η2 21
e (T ) ) − F (w∗ )} ≤ (1 + 2lλ2 )T Θ
E{F (w ln + 2 > ln . (62)
r ! σA 2σA πδ
T
λ1 T βc 2 λ0 T 2 c2 (1 + 2lλ2 ) − 1
+ + 2 Let us set σA = c∆sD T /bǫ, if bǫ/T ∈ (0, 1), the inequa-
ǫ Nπ ǫ N 2lλ2 (53)
2
tion (62) can be solved as
κ1 T κ0 T
= PTΘ + + 2 1 − PT , 2 1.25
ǫ ǫ c ≥ 2 ln . (63)
q δ
λ1 βc 2
where P = 1 + 2lλ2 , κ1 = m(P −1) N π and κ0 = Meanwhile, ǫ and T should satisfy
λ0 c2
q −ǫ
m2 (P −1)N . This completes the proof.
ǫ < −T ln 1 − q + or T > . (64)
e ln 1 − q + qe
A PPENDIX F
If bǫ/T > 1, we can also obtain σA = c∆sD T /bǫ by adjusting
P ROOF OF L EMMA 4
the value of c. The standard deviation of requiring noises is
We define the sampling parameter q , K/N to represent given as
the probability of being selected by the server for each client c∆sD T
σA ≥ . (65)
in an aggregation. Let M1:T denote (M1 , . . . , MT ) and bǫ
similarly let o1:T denote a sequence of outcomes (o1 , . . . , oT ). Hence, if Gaussian noises are added at the client sides, we can
Considering a global (ǫ, δ)-DP in the downlinks channels, obtain the additive noise scale in the server as
we use σA to represent the standard deviation of aggregated s 2
Gaussian noises. With neighboring datasets Di and Di′ , we are c∆sD T c2 L2 ∆s2U
looking at σD = −
bǫ Kǫ2
′  q
Pr[M1:T (Di,1:T ) = o1:T ]  2cC Tb22 −L2 K √
ln T > bL K, (66)
Pr[M1:T (Di,1:T ) = o1:T ] = mKǫ √
0 T ≤ bL K.
−
knk2
−
kn+∆sD k2
X T 2
(1 − q)e 2σA + qe 2σ 2
A Furthermore, considering (60), we can obtain
= ln (54)
−
knk2
 q 2
i=1 e 2σ 2
A  2cC Tb2 −L2 K
! σD = mKǫ T > γǫ , (67)
Y T 2n∆sD +∆s2 D 0
−
2σ 2 T ≤ γǫ ,
= ln 1 − q + qe A .

i=1
where
This quantity is bounded by ǫ, we require −ǫ

√
′ γ = − ln 1 − q + qe L K . (68)
Pr[M1:T (Di,1:T ) = o1:T ]
ln
Pr[M1:T (Di,1:T ) = o1:T ] ≤ ǫ. (55) This completes the proof.
13
A PPENDIX G and
PROOF OF T HEOREM 3 K
1 X (t+1) 2 K − 1 (t+1) 2
Here we define v(t+1) k2 } ≤
E{ke 2
kwi k + kw k
K K i=1 K
X (t)
v(t) = pi w i , (69) + kn(t+1) k2 + 2[w(t+1) ]⊤ n(t+1) . (79)
i=1
K
X Combining (75) and (79), we can obtain
(t) (t) (t)
e (t) =
v pi w i + n i + nD (70)
i=1 E{kw(t+1) − v
e (t+1) k2 }
K
and 1 X (t+1)
K
X (t+1) (t)
≤ kwi e (t) k2 + kn(t+1) k2 .
−v (80)
n(t+1) = pi n i + nD . (71) K 2 i=1
i=1
Using (41), we know
which considers the aggregated parameters under K-random
scheduling. Because Fi (·) and F (·) are β-Lipschitz, we obtain E{kw(t+1) − v
e (t+1) k2 } ≤ kn(t+1) k2 +
that
B 2 (1 + θ)2
v(t+1) )} − F (w(t+1) ) ≤ βke
E{F (e v(t+1) − w(t+1) k. (72) v(t) )k2 .
k∇F (e (81)
Kµ2
Because β is the Lipchitz continuity constant of function F , Moreover,
we have

β ≤ k∇F (e v(t) )k + ρ kw(t+1) − v e (t) k E{kw(t+1) − v
e (t+1) k} ≤ kn(t+1) k
(73) B(1 + θ)
+kv(t+1) − v e (t) k . + √ vt )k.
k∇F (e (82)
µ K
From (42), we know Substituting (45), (73) and (82) into (72), setting θ = 0 and
B(1 + θ) µ = µ, we can obtain
kw(t+1) − v
e (t) k ≤ v(t) )k.
k∇F (e (74)
µ
v(t+1) )} − F (e
E{F (e v(t) ) ≤ F (w(t+1) ) − F (e
v(t) )
Then, we have
v(t) )k + 2ρkw(t+1) − v
k∇F (e e (t) k Ekw(t+1) − v e (t+1) k
E{kw(t+1) − v
e (t+1) k2 } = kw(t+1) k2
+ ρE{kw(t+1) − v
e (t+1) k2 } = α2 k∇F (e
v(t) )k2
− 2[w(t+1) ]⊤ E{e
v(t+1) } + E{ke
v(t+1) k2 }. (75)
+ α1 kn(t+1) kk∇F (e
v(t) )k + α0 kn(t+1) k2 , (83)
Furthermore, we can obtain
where
N
XN
(t+1) 1 K (t+1) 1 ρB 2 ρB 2 2ρB 2 µB
E{ev }= K pi w i + n(t+1) α2 = 2 + ρB + + √ + √ − µ , (84)
N N i=1 (76) µ 2 K K K
K
(t+1) √
= E{wi } + n(t+1) = w(t+1) + n(t+1) 2ρB 2ρB K 2ρK
α1 = 1+ + and α0 = + ρ. (85)
and µ µN N
 2 
 X K  v(t+1) ) − F (e
In this case, we take expectation E{F (e v(t) )}
(t+1)
v(t+1) k2 } = E
E{ke pi w i as follows,
 
i=1
 2  " #⊤  v(t+1) ) − F (e
E{F (e v(t) )} ≤ α2 k∇F (e
v(t) )k2
 XK   X K  (86)
+E
(t+1)
pi n i +2E
(t+1)
pi w i n(t+1) . + α1 E{kn(t+1) k}k∇F (e
v(t) )k + α0 E{kn(t+1) k2 }.
   
i=1 i=1
For Θ > 0 and f (v(0) ) − f (w∗ ) = Θ, we can obtain
(77)
PK v(t+1) ) − F (w∗ )}
E{F (e
Note that we set pi = Di / i=1 Di = 1/K in K-client
random scheduling in order to a small sensitivity ∆sD . We v(t) ) − F (w∗ )} + α2 k∇F (e
≤ E{F (e v(t) )k2
have + α1 βE{kn(t+1) k} + α0 E{kn(t+1) k2 }. (87)
 2 
 XK 
(t+1) If we select the penalty parameter µ to make α2 < 0 and
E pi w i
  using (48), we know
i=1
K
1 X (t+1) 2 K − 1 (t+1) 2 v(t+1) ) − F (w∗ )} ≤ (1 + 2lα2 )E{F (e
E{F (e v(t) ) − F (w∗ )}
≤ kwi k + kw k (78)
K 2 i=1 K + α1 βE{kn(t+1) k} + α0 E{kn(t+1) k2 }. (88)
14
Considering independence of additive noises and applying (88) [15] H. H. Yang, A. Arafa, T. Q. S. Quek, and H. V. Poor, “Age-based
recursively, we have scheduling policy for federated learning in mobile edge networks,” in
Proc. IEEE Int. Conf. Acoust. Speech Signal Process. (ICASSP), 2019.
[Online]. Available: https://arxiv.org/abs/1910.14648
v(T ) ) − F (w∗ )} ≤ (1 + 2lα2 )T E{F (v(0) ) − F (w∗ )}
E{F (e [16] A. Alekh and D. J. C, “Distributed Delayed Stochastic Optimization,”
in Proc. IEEE CDC, Maui, HI, USA, Dec. 2012.
1 − (1 + 2lα2 )T
+ α1 βE{knk} + α0 E{knk2 } [17] L. Xiangru, H. Yijun, L. Yuncheng, and L. Ji, “Asynchronous Parallel
2lα2 Stochastic Gradient for Nonconvex Optimization,” in Proc. ACM NIPS,
1 − QT Montreal, Canada, Dec. 2015, pp. 2737–2745.
= QT Θ + α1 βE{knk} + α0 E{knk2 } , (89) [18] X. Lian et al., “Can Decentralized Algorithms Outperform Centralized
1−Q Algorithms? A Case Study for Decentralized Parallel Stochastic Gradient
Descent,” in Proc. ACM NIPS, Long Beach, California, USA, Dec. 2017,
where Q = 1+2lα2 . Substituting (65) into (89), we can obtain pp. 5336–5346.
r [19] T. Li et al., “On the Convergence of Federated Optimization
∆sD T c 2N ∆s2D T 2 c2 N in Heterogeneous Networks,” arXiv, 2018. [Online]. Available:
E{knk} = , E{knk2 } = (90) http://arxiv.org/abs/1812.06127
bǫ π b2 ǫ 2 [20] S. Wang et al., “Adaptive Federated Learning in Resource Constrained
and Edge Computing Systems,” IEEE J. Sel. Areas Commun., vol. 37, no. 6,
pp. 1205–1221, Jun. 2019.
vT ) − F (w∗ )} ≤ QT Θ
E{F (e [21] R. Shokri and V. Shmatikov, “Privacy-Preserving Deep Learning,” in
r Proc. ACM CCS, Denver, Colorado, USA, Oct. 2015, pp. 1310–1321.
1 − QT cCα1 β 2 [22] Z. Wang et al., “Beyond Inferring Class Representatives: User-Level
+ N N − Tǫ
Privacy Leakage From Federated Learning,” in Proc. IEEE INFOCOM,
1−Q −mK ln 1 − K +K e π (91) Paris, France, Apr. 2019, pp. 2512–2520.
! [23] C. Ma, J. Li, M. Ding, H. Hao Yang, F. Shu, T. Q. S. Quek,
c2 C 2 α0 and H. V. Poor, “On Safeguarding Privacy and Security in the
+ 2 2 2 N N − Tǫ
.
m K ln 1 − K +K e Framework of Federated Learning,” arXiv, 2019. [Online]. Available:
https://arxiv.org/abs/1909.06512
This completes the proof. [24] P. Kairouz, H. B. McMahan, B. Avent, A. Bellet, M. Bennis, A. N.
Bhagoji, K. Bonawitz, Z. Charles, G. Cormode, R. Cummings et al.,
“Advances and Open Problems in Federated Learning,” arXiv, Dec.
R EFERENCES 2019. [Online]. Available: https://arxiv.org/abs/1912.04977
[25] C. Dwork and A. Roth, “The Algorithmic Foundations of Differential
[1] J. Li, S. Chu, F. Shu, J. Wu, and D. N. K. Jayakody, “Contract-Based Privacy,” Foundations and Trendsr in Theoretical Computer Science,
Small-Cell Caching for Data Disseminations in Ultra-Dense Cellular vol. 9, no. 3-4, pp. 211–407, 2014.
Networks,” IEEE Trans. Mobile Comput., vol. 18, no. 5, pp. 1042–1053, [26] A. Blum, C. Dwork, F. McSherry, and K. Nissim, “Practical Privacy:
May 2019. The SuLQ Framework,” in Proc. ACM PODS, Baltimore, Maryland, Jun.
[2] Z. Ma, M. Xiao, Y. Xiao, Z. Pang, H. V. Poor, and B. Vucetic, “High- 2005, pp. 128–138.
Reliability and Low-Latency Wireless Communication for Internet of [27] Úlfar Erlingsson, V. Pihur, and A. Korolova, “RAPPOR: Randomized
Things: Challenges, Fundamentals, and Enabling Technologies,” IEEE Aggregatable Privacy-Preserving Ordinal Response,” in Proc. ACM CCS,
Internet Things J., vol. 6, no. 5, pp. 7946–7970, Oct. 2019. Scottsdale, Arizona, USA, Nov. 2014, pp. 1054–1067.
[3] H. Lee, S. H. Lee, and T. Q. S. Quek, “Deep Learning for Distributed [28] N. Wang et al., “Collecting and Analyzing Multidimensional Data with
Optimization: Applications to Wireless Resource Management,” IEEE Local Differential Privacy,” in Proc. IEEE ICDE, Macao, China, Apr.
J. Sel. Areas Commun., vol. 37, no. 10, pp. 2251–2266, Oct. 2019. 2019, pp. 638–649.
[4] W. Sun, J. Liu, and Y. Yue, “AI-Enhanced Offloading in Edge Comput- [29] S. Wang, L. Huang, Y. Nie, X. Zhang, P. Wang, H. Xu, and W. Yang,
ing: When Machine Learning Meets Industrial IoT,” IEEE Netw., vol. 33, “Local Differential Private Data Aggregation for Discrete Distribution
no. 5, pp. 68–74, Sep. 2019. Estimation,” IEEE Trans. Parallel Distrib. Syst., vol. 30, no. 9, pp. 2046–
[5] M. Mohammadi, A. Al-Fuqaha, S. Sorour, and M. Guizani, “Deep 2059, Sep. 2019.
Learning for IoT Big Data and Streaming Analytics: A Survey,” IEEE [30] A. Martin et al., “Deep Learning with Differential Privacy,” in Proc.
Commun. Surveys Tuts., vol. 20, no. 4, pp. 2923–2960, Jun. 2018. ACM CCS, Vienna, Austria, Oct. 2016, pp. 308–318.
[6] H. B. McMahan, E. Moore, D. Ramage, and B. A. y Arcas, “Federated [31] N. Wu, F. Farokhi, D. Smith, and M. A. Kâafar, “The Value of
Learning of Deep Networks using Model Averaging,” arXiv, 2016. Collaboration in Convex Machine Learning with Differential Privacy,”
[Online]. Available: http://arxiv.org/abs/1602.05629 arXiv, 2019. [Online]. Available: http://arxiv.org/abs/1906.09679
[7] J. Konecný et al., “Federated Learning: Strategies for Improving [32] J. Li, M. Khodak, S. Caldas, and A. Talwalkar, “Differentially
Communication Efficiency,” arXiv, 2016. [Online]. Available: http: Private Meta-Learning,” arXiv, 2019. [Online]. Available: https:
//arxiv.org/abs/1610.05492 //arxiv.org/abs/1909.05830
[8] U. Mohammad and S. Sorour, “Adaptive Task Allocation for [33] H. B. McMahan, D. Ramage, K. Talwar, and L. Zhang, “Learning
Asynchronous Federated Mobile Edge Learning,” arXiv, 2019. Differentially Private Language Models Without Losing Accuracy,”
[Online]. Available: http://arxiv.org/abs/1905.01656 arXiv, Feb. 2018. [Online]. Available: http://arxiv.org/abs/1710.06963
[9] X. Wang et al., “In-Edge AI: Intelligentizing Mobile Edge Computing, [34] T. Ryffel, A. Trask, M. Dahl, B. Wagner, J. Mancuso, D. Rueckert,
Caching and Communication by Federated Learning,” IEEE Network, and J. Passerat-Palmbach, “A Generic Framework for Privacy
vol. 33, no. 5, pp. 156–165, Sep. 2019. Preserving Deep Learning,” arXiv, Nov. 2018. [Online]. Available:
[10] Y. Qiang, L. Yang, C. Tianjian, and T. Yongxin, “Federated Machine http://arxiv.org/abs/1811.04017
Learning: Concept and Applications,” ACM Trans. Intell. Syst. Technol., [35] R. C. Geyer, T. Klein, and M. Nabi, “Differentially Private Federated
vol. 10, no. 2, pp. 12:1–12:19, Jan. 2019. Learning: A Client Level Perspective,” arXiv, 2017. [Online]. Available:
[11] T. Li, A. K. Sahu, A. Talwalkar, and V. Smith, “Federated Learning: http://arxiv.org/abs/1712.07557
Challenges, Methods, and Future Directions,” arXiv, 2019. [Online]. [36] S. Truex et al., “A Hybrid Approach to Privacy-Preserving Federated
Available: https://arxiv.org/abs/1908.07873 Learning,” arXiv, 2018. [Online]. Available: http://arxiv.org/abs/1812.
[12] N. H. Tran, W. Bao, A. Zomaya, N. Minh N.H., and C. S. Hong, “Fed- 03224
erated Learning over Wireless Networks: Optimization Model Design [37] M. Fredrikson, S. Jha, and T. Ristenpart, “Model inversion attacks that
and Analysis,” in Proc. IEEE INFOCOM, Apr. 2019, pp. 1387–1395. exploit confidence information and basic countermeasures,” in Proc.
[13] H. H. Yang, Z. Liu, T. Q. S. Quek, and H. V. Poor, “Scheduling Policies ACM CCS, New York, NY, USA, 2015, pp. 1322–1333.
for Federated Learning in Wireless Networks,” IEEE Transactions on [38] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
Communications, vol. 68, no. 1, pp. 317–333, 2020. applied to document recognition,” Proc. IEEE, vol. 86, no. 11, pp. 2278–
[14] M. Hao, H. Li, G. Xu, S. Liu, and H. Yang, “Towards Efficient and 2324, Nov. 1998.
Privacy-Preserving Federated Deep Learning,” in Proc. IEEE ICC, Paris, [39] Y. Nesterov, Introductory Lectures on Convex Optimization: A Basic
France, May 2019, pp. 1–6. Course, 1st ed. Springer Publishing Company, Incorporated, 2014.
15
Kang Wei received the B.Sc. degree in information Chuan Ma received the B.S. degree from the Bei-
engineering from Xidian University, Xian, China, in jing University of Posts and Telecommunications,
2014. From September 2017 to August 2018, he Beijing, China, in 2013 and Ph.D. degree from the
was a M.Sc. candidate, and from September 2018 University of Sydney, Australia, in 2018. He is now
to now he is a Ph.D. candidate at the school of working as a lecturer at the School of Electronic and
Electronic and Optical Engineering, Nanjing Uni- Optical Engineering, Nanjing University of Science
versity of Science and Technology, Nanjing, China. and Technology, Nanjing, China. He has published
His current research interests include data privacy more than 10 journal and conference papers, includ-
and security, differential privacy, AI and machine ing a best paper in WCNC 2018. His research in-
learning, information theory, and channel coding terests include stochastic geometry, wireless caching
theory in NAND flash memory. networks and machine learning, and now focuses on
the big data analysis and privacy preservation.
Jun Li (M’09-SM’16) received Ph. D degree in

Electronic Engineering from Shanghai Jiao Tong
University, Shanghai, P. R. China in 2009. From
January 2009 to June 2009, he worked in the De-
partment of Research and Innovation, Alcatel Lu-
cent Shanghai Bell as a Research Scientist. From
June 2009 to April 2012, he was a Postdoctoral
Fellow at the School of Electrical Engineering and
Telecommunications, the University of New South Howard H. Yang (S’13-M’17) received the B.Sc.
Wales, Australia. From April 2012 to June 2015, degree in Communication Engineering from Harbin
he is a Research Fellow at the School of Electrical Institute of Technology (HIT), China, in 2012, and
Engineering, the University of Sydney, Australia. From June 2015 to now, he the M.Sc. degree in Electronic Engineering from
is a Professor at the School of Electronic and Optical Engineering, Nanjing Hong Kong University of Science and Technology
University of Science and Technology, Nanjing, China. He was a visiting (HKUST), Hong Kong, in 2013. He earned the Ph.D.
professor at Princeton University from 2018 to 2019. His research interests degree in Electronic Engineering from Singapore
include network information theory, game theory, distributed intelligence, University of Technology and Design (SUTD), Sin-
multiple agent reinforcement learning, and their applications in ultra-dense gapore, in 2017. From August 2015 to March 2016,
wireless networks, mobile edge computing, network privacy and security, and he was a visiting student in the WNCG under super-
industrial Internet of things. He has co-authored more than 200 papers in IEEE visor of Prof. Jeffrey G. Andrews at the University
journals and conferences, and holds 1 US patents and more than 10 Chinese of Texas at Austin. Dr. Yang is now a Postdoctoral Research Fellow with
patents in these areas. He was serving as an editor of IEEE Communication Singapore University of Technology and Design in the Wireless Networks
Letters and TPC member for several flagship IEEE conferences. He received and Decision Systems (WNDS) group led by Prof. Tony Q. S. Quek.
Exemplary Reviewer of IEEE Transactions on Communications in 2018, and He has held a visiting research appointment at Princeton University from
best paper award from IEEE International Conference on 5G for Future September 2018 to April 2019. His research interests cover various aspects
Wireless Networks in 2017. of wireless communications, networking, and signal processing, currently
focusing on the modeling of modern wireless networks, high dimensional
statistics, graph signal processing, and machine learning. He received the
IEEE WCSP 10-Year Anniversary Excellent Paper Award in 2019 and the
IEEE WCSP Best Paper Award in 2014.
Ming Ding (M’12-SM’17) received the B.S. and

M.S. degrees (with first class Hons.) in electronics
engineering from Shanghai Jiao Tong University
(SJTU), Shanghai, China, and the Doctor of Phi-
losophy (Ph.D.) degree in signal and information
processing from SJTU, in 2004, 2007, and 2011,
respectively. From April 2007 to September 2014, he
worked at Sharp Laboratories of China in Shanghai,
China as a Researcher/Senior Researcher/Principal
Researcher. He also served as the Algorithm Design
Director and Programming Director for a system- Farhad Farokhi is a Lecturer (=Assistant Profes-
level simulator of future telecommunication networks in Sharp Laboratories sor) at the Department of Electrical and Electronic
of China for more than 7 years. Currently, he is a senior research scientist Engineering at the University of Melbourne. Prior to
at Data61, CSIRO, in Sydney, NSW, Australia. His research interests include that, he was a Research Scientist at the Information
information technology, data privacy and security, machine Learning and AI, Security and Privacy Group at CSIRO’s Data61, a
etc. He has authored over 100 papers in IEEE journals and conferences, all in Research Fellow at the University of Melbourne,
recognized venues, and around 20 3GPP standardization contributions, as well and Postdoctoral Fellow at KTH Royal Institute of
as a Springer book “Multi-point Cooperative Communication Systems: Theory Technology. In 2014, he received his PhD degree
and Applications”. Also, he holds 21 US patents and co-invented another from KTH Royal Institute of Technology. During
100+ patents on 4G/5G technologies in CN, JP, KR, EU, etc. Currently, he his PhD studies, he was a visiting researcher at
is an editor of IEEE Transactions on Wireless Communications and IEEE the University of California at Berkeley and the
Wireless Communications Letters. Besides, he is or has been Guest Editor/Co- University of Illinois at Urbana-Champaign. Farhad has been the recipient of
Chair/Co-Tutor/TPC member of several IEEE top-tier journals/conferences, the VESKI Victoria Fellowship from the Victorian State Government as well
e.g., the IEEE Journal on Selected Areas in Communications, the IEEE as the McKenzie Fellowship and the 2015 Early Career Researcher Award
Communications Magazine, and the IEEE Globecom Workshops, etc. He was from the University of Melbourne. He was a finalist in the 2014 European
the lead speaker of the industrial presentation on unmanned aerial vehicles Embedded Control Institute (EECI) PhD Award. He has been part of numerous
in IEEE Globecom 2017, which was awarded as the Most Attended Industry projects on data privacy and cyber-security funded by the Defence Science and
Program in the conference. Also, he was awarded in 2017 as the Exemplary Technology Group (DSTG), the Department of the Prime Minister and Cabinet
Reviewer for IEEE Transactions on Wireless Communications. (PMC), the Department of Environment and Energy (DEE), and CSIRO in
Australia.
16
Shi Jin (S’06-M’07-SM’17) received the B.S. de- H. Vincent Poor (S’72-M’77-SM’82-F’87) received
gree in communications engineering from Guilin the Ph.D. degree in EECS from Princeton University
University of Electronic Technology, Guilin, China, in 1977. From 1977 until 1990, he was on the faculty
in 1996, the M.S. degree from Nanjing University of of the University of Illinois at Urbana-Champaign.
Posts and Telecommunications, Nanjing, China, in Since 1990 he has been on the faculty at Princeton,
2003, and the Ph.D. degree in information and com- where he is the Michael Henry Strater University
munications engineering from the Southeast Univer- Professor of Electrical Engineering. From 2006 until
sity, Nanjing, in 2007. From June 2007 to October 2016, he served as Dean of Princetons School of
2009, he was a Research Fellow with the Adastral Engineering and Applied Science. He has also held
Park Research Campus, University College London, visiting appointments at several other institutions,
London, U.K. He is currently with the faculty of the including most recently at Berkeley and Cambridge.
National Mobile Communications Research Laboratory, Southeast University. His research interests are in the areas of information theory, signal processing
His research interests include space time wireless communications, random and machine learning, and their applications in wireless networks, energy
matrix theory, and information theory. He serves as an Associate Editor for the systems and related fields. Among his publications in these areas is the
IEEE Transactions on Wireless Communications, and IEEE Communications forthcoming book Advanced Data Analytics for Power Systems (Cambridge
Letters, and IET Communications. Dr. Jin and his co-authors have been University Press, 2020).
awarded the 2011 IEEE Communications Society Stephen O. Rice Prize Paper Dr. Poor is a member of the National Academy of Engineering and the
Award in the field of communication theory and a 2010 Young Author Best National Academy of Sciences, and is a foreign member of the Chinese
Paper Award by the IEEE Signal Processing Society. Academy of Sciences, the Royal Society and other national and international
academies. He received the Technical Achievement and Society Awards of
the IEEE Signal Processing Society in 2007 and 2011, respectively. Recent
recognition of his work includes the 2017 IEEE Alexander Graham Bell
Medal, the 2019 ASEE Benjamin Garver Lamme Award, a D.Sc. honoris
causa from Syracuse University, awarded in 2017, and a D.Eng. honoris causa
from the University of Waterloo, awarded in 2019.
Tony Q.S. Quek (S’98-M’08-SM’12-F’18) received

the B.E. and M.E. degrees in electrical and electron-
ics engineering from the Tokyo Institute of Technol-
ogy, Tokyo, Japan, in 1998 and 2000, respectively,
and the Ph.D. degree in electrical engineering and
computer science from the Massachusetts Institute
of Technology, Cambridge, MA, USA, in 2008.
Currently, he is the Cheng Tsang Man Chair Pro-
fessor with Singapore University of Technology and
Design (SUTD). He also serves as the Head of ISTD
Pillar, Sector Lead of the SUTD AI Program, and the
Deputy Director of the SUTD-ZJU IDEA. His current research topics include
wireless communications and networking, network intelligence, internet-of-
things, URLLC, and big data processing.
Dr. Quek has been actively involved in organizing and chairing sessions,
and has served as a member of the Technical Program Committee as
well as symposium chairs in a number of international conferences. He is
currently serving as an Editor for the IEEE T RANSACTIONS ON W IRELESS
C OMMUNICATIONS, the Chair of IEEE VTS Technical Committee on Deep
Learning for Wireless Communications as well as an elected member of the
IEEE Signal Processing Society SPCOM Technical Committee. He was an
Executive Editorial Committee Member for the IEEE T RANSACTIONS ON
W IRELESS C OMMUNICATIONS, an Editor for the IEEE T RANSACTIONS ON
C OMMUNICATIONS, and an Editor for the IEEE W IRELESS C OMMUNICA -
TIONS L ETTERS.
Dr. Quek was honored with the 2008 Philip Yeo Prize for Outstanding
Achievement in Research, the 2012 IEEE William R. Bennett Prize, the 2015
SUTD Outstanding Education Awards – Excellence in Research, the 2016
IEEE Signal Processing Society Young Author Best Paper Award, the 2017
CTTC Early Achievement Award, the 2017 IEEE ComSoc AP Outstanding
Paper Award, and the 2016-2019 Clarivate Analytics Highly Cited Researcher.
He is a Distinguished Lecturer of the IEEE Communications Society and a
Fellow of IEEE.

Paper 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Paper 1

Uploaded by

Copyright:

Available Formats

1

Federated Learning with Differential Privacy:

TABLE I: Summary of Main Notation

a global (ǫ, δ)-DP, and the standard deviation σD of additive

Value of the Loss Function

Value of the Loss Function

Value of the Loss Function

NbAFL. Furthermore, we also include a non-private approach 2.2

Value of the Loss Function

this experiment, where each client has 512 training samples

Value of the Loss Function

algorithm, which validates Remark 4. From Fig. 9, we can

= 100 (Simulation Results)

= 100 (N=50, K=30) VII. C ONCLUSIONS

Di ,Di Note that when k∇F (w)k2 6= 0, there exists

(t+1) (t+1) (t+1) (t+1)

Jun Li (M’09-SM’16) received Ph. D degree in

Ming Ding (M’12-SM’17) received the B.S. and

Tony Q.S. Quek (S’98-M’08-SM’12-F’18) received

You might also like