Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Confined Gradient Descent: Privacy-preserving Optimization

for Federated Learning


Yanjun Zhang Guangdong Bai∗ Xue Li
The University of Queensland The University of Queensland The University of Queensland
St Lucia, Queensland, Australia St Lucia, Queensland, Australia St Lucia, Queensland, Australia
yanjun.zhang@uq.edu.au g.bai@uq.edu.au xueli@itee.uq.edu.au

Surya Nepal Ryan K L Ko


Data61 CSIRO Australia The University of Queensland
arXiv:2104.13050v1 [cs.LG] 27 Apr 2021

MARSFIELD, New South Wales St Lucia, Queensland, Australia


Australia ryan.ko@uq.edu.au
Surya.Nepal@data61.csiro.au

ABSTRACT mode) [25]. During the training process, the participants work on
Federated learning enables multiple participants to collaboratively the same intermediate global model via a coordinating server (in
train a model without aggregating the training data. Although the the centralized FL) [2, 4, 34] or a peer-to-peer communication
training data are kept within each participant and the local gradi- scheme (in the decentralized FL) [22, 39]. Each of them obtains
ents can be securely synthesized, recent studies have shown that the current model parameters, works out a local gradient based on
such privacy protection is insufficient. The global model parame- the local data, and disseminates it to update the global model syn-
ters that have to be shared for optimization are susceptible to leak chronously [2, 4] or asynchronously [21]. This paradigm guarantees
information about training data. In this work, we propose Confined data locality, but has been found insufficient for data privacy: al-
Gradient Descent (CGD) that enhances privacy of federated learn- though the local gradients can be securely synthesized via a variety
ing by eliminating the sharing of global model parameters. CGD of techniques such as differential privacy (DP) [2, 21, 42, 48], secure
exploits the fact that a gradient descent optimization can start with multi-party communication (MPC) [4, 11, 34], and homomorphic
a set of discrete points and converges to another set at the neigh- encryption (HE) [31, 34, 40], the global model parameters that have
borhood of the global minimum of the objective function. It lets the to be shared are still susceptible to information leakage(cf. Section
participants independently train on their local data, and securely 5 and [35, 36]).
share the sum of local gradients to benefit each other. We formally This work further decreases the dependency among participants
demonstrate CGD ’s privacy enhancement over traditional FL. We by eliminating the explicit sharing of the central global model which
prove that less information is exposed in CGD compared to that of is the root cause of the information leakage [9, 35]. We propose
traditional FL. CGD also guarantees desired model accuracy. We the- a new optimization algorithm named Confined Gradient Descent
oretically establish a convergence rate for CGD. We prove that the (CGD) that enables each participant to learn a proprietary global
loss of the proprietary models learned for each participant against model. The CGD participants maintain their global models locally,
a model learned by aggregated training data is bounded. Extensive which are strictly confined within themselves from the beginning
experimental results on two real-world datasets demonstrate the of and throughout the whole training process. We refer to these
performance of CGD is comparable with the centralized learning, localized global models as confined models, to distinguish them from
with marginal differences on validation loss (mostly within 0.05) the global model in traditional FL.
and accuracy (mostly within 1%). CGD is inspired by an observation on the surface of the typi-
cal cost function. The steepness of the first derivative decreases
1 INTRODUCTION slower when approaching the minimum of the function, due to the
small values in the Hessian (i.e., the second derivative) near the
The performance of machine learning largely relies on the availabil-
optimum [5]. This gives the function, when plotted, a flat valley
ity of large representative datasets. To take advantage of massive
bottom. As such, a gradient descent algorithm A, when applied
data owned by multiple entities, federated learning (FL) is pro-
on an objective function 𝐹 , could start with a set of discrete points
posed [24, 25, 46]. It enables participants to jointly train a global
(referred to as a colony and their distance is discussed later). Itera-
model without the necessity of sharing their datasets, demonstrat-
tively descending the colony using the joint gradient of the colony
ing the potential to address the issues of data privacy and data
would lead A to the neighborhood of 𝐹 ’s minimum in the “flat
ownership. It has been incorporated by popular machine learning
valley bottom”. The points in the colony would also end up with
tools such as TensorFlow [1] and PyTorch [37], and increasingly
similar losses that are close to the loss of the minimum.
spread over various industries.
In Figure 1, we illustrate a holistic comparison between the
The privacy preservation of FL stems from its parallelization
workflow of CGD and that of a gradient decent in traditional FL. In
of the gradient descent optimization, which in essence is an ap-
traditional FL, every participant updates the same global model 𝑤
plication of stochastic gradient descent (SGD) (or the mini-batch
using their local gradients 𝑔𝑎 , 𝑔𝑏 , 𝑔𝑐 . In CGD, each participant 𝑙 first
∗ The corresponding author. independently initializes the starting point of its confined model
1
𝑤 1𝑙 . Then, in every training iteration, participants independently any FL schemes regardless of their underlying machine/deep
compute the local gradient from their current confined model and learning algorithms. It also eliminates the necessity of a cen-
local data, and then jointly work out the sum of all local gradients tral coordinating server, such that the optimization can be
and use it to update their confined models (the equation in Figure conducted in a fully decentralized manner.
1b). By doing this, CGD aims to enhance privacy without sacrificing • Convergence Analysis. We theoretically establish a con-
much model accuracy. For the sake of simplicity, we refer to these vergence rate for CGD under realistic assumptions on the
two properties as privacy and accuracy. loss function (such as convexity). We prove CGD converges
• Privacy. CGD should ensure that, throughout the training toward the centralized model as the number of iterations
process, neither local data of a participant nor intermediate increases. The distance between the trained confined models
results computed on them can be observed by other partici- and the centralized model is bounded, and can be tuned by
pants or an aggregator (if any). the hyper-parameter setting.
• Accuracy. The prediction made by any confined model • Enhanced Privacy Preservation Over Traditional FL.
should approach the centralized model that were to learn With secrecy of both confined model and local gradients,
centrally on the gathered data. CGD achieves enhanced privacy preservation over tradi-
tional FL. We prove that in CGD, given only the sum of the
The desired privacy enhancement of CGD stems from two aspects, local gradients, an honest-but-curious white-box adversary,
i.e., secrecy of confined models and secrecy of local gradients. For who may control 𝑡 out of 𝑚 participants (where 𝑡 ≤ 𝑚 − 2)
the former, besides always hiding the confined models from each including the aggregator for secure addition operation (if
other, each participant independently initializes its 𝑤 1𝑙 at random. any), can learn no information other than their own inputs
During the training process, any two confined models keep the same and the sum of the local gradients from other honest parties,
distance and never become closer to each other after descending, whereas in traditional FL, extra indicative information about
preventing any participant from predicting models of others. To local data can be obtained.
further boost the unpredictability, each participant could select its • Functional Evaluations. We implement CGD and conduct
own interval range of initial weights to avoid leaking the average experiments on two popular benchmark datasets MNIST and
distance between the confined models. CGD withstands interval CIFAR-10. The results demonstrate that CGD can closely
ranges differing by two orders of magnitude. For the latter, CGD approach the performance of centralized model on the vali-
incorporates the secure addition operation on the local gradients dation loss and accuracy, with marginal differences on vali-
to calculate their sum. This has been proved to be viable through dation loss (mostly within 0.05) and accuracy (mostly within
the additive secret sharing scheme, in which the sum of a set of 1%).
secret values is collaboratively calculated without revealing any
addends [3, 4, 29, 45]. A previous study [47] demonstrates it is 2 BACKGROUND AND RELATED WORKS
efficient when applied to achieve decentralization in FL. We prove
that the adversary’s observation in CGD is only the sum of local CGD is an optimization method based on the gradient decent. There-
gradients, and it conceals the extra indicative information that fore, in this section, we review the existing techniques for gradient
traditional FL would leak (cf. Section 5). updates in the traditional FL.
We formally prove that CGD ensures convergence. It converges
to confined models that are adjacent to the centralized model, and 2.1 Stochastic gradient descent
the adjacency is bounded (cf. Section 4). This merit guarantees the Stochastic gradient descent (SGD) [6, 38] is an efficient variant of
accuracy of CGD. We further evaluate the accuracy performance the gradient descent algorithm. It is extensively used for optimizing
of CGD with two popular benchmark datasets, i.e., MNIST [28] the objective function in machine learning and deep learning. Given
and CIFAR-10 [26]. Our experiments demonstrate that its accu- a cost function 𝐹 with the parameter 𝑤, SGD is defined by
racy closely approaches that of the centralized learning. When 1
the confined models are initialized with the standard initialization 𝑤𝑘+1 ← 𝑤𝑘 − 𝛼𝑘 ∇𝐹 (𝑤𝑘 , 𝜉𝑘 ), (1)
|𝜉𝑘 |
scheme (i.e., the Gaussian distribution of mean 0 and the variance
1), it achieves marginal differences on validation loss (mostly within where 𝑤𝑘 are the parameter at the 𝑘 𝑡ℎ iteration, 𝜉𝑘 ∈ 𝜉 is a randomly
0.05) and accuracy (mostly within 1%) and outperforms state-of- selected subset of the training samples at the 𝑘 𝑡ℎ iteration, and 𝛼𝑘
the-art federated learning with differential privacy. Its accuracy is the learning rate. Equation 1 can generalize to the mini-batch
performance remains stable even when the interval ranges of initial update when 1 < |𝜉𝑘 | < |𝜉 |, and to the batch update when 𝜉𝑘 = 𝜉.
weights among participants differ by two orders of magnitude. In FL, each local participant 𝑙 ∈ L holds a subset of the training
Contributions. We summarize the main contributions as follows. samples, denoted by 𝜉𝑙 . To run SGD (or the mini-batch update), for
• Confined Gradient Descent For Privacy-enhancing De- each iteration, a random subset 𝜉𝑙,𝑘 ⊆ 𝜉𝑙 from a random participant
centralized Federated Learning. We propose a new opti- 𝑙 is selected. The participant 𝑙 then computes the gradient with
mization algorithm CGD for privacy-preserving decentral- respect to 𝜉𝑙,𝑘 , which can be written as ∇𝐹 (𝑤𝑘 , 𝜉𝑙,𝑘 ), and shares
ized FL. CGD eliminates the explicit sharing of the global the gradient with other participants (or a parameter server). All the
model and lets each participant learn a proprietary confined participants (or the server) can thus take a gradient descent step by
model. CGD retains the merits of traditional FL such as algo- 1
𝑤𝑘+1 ← 𝑤𝑘 − 𝛼𝑘 ∇𝐹 (𝑤𝑘 , 𝜉𝑙,𝑘 ). (2)
rithm independence. Therefore, it can easily accommodate |𝜉𝑙,𝑘 |
2
𝑤#$ local data 𝜉"
𝑤! 𝑤#"
local data 𝜉! 𝑔$ = 𝑔(𝑤 $ , 𝜉$ )
local data 𝜉! 𝑔# = 𝑔(𝑤, 𝜉# )
𝑔" = 𝑔(𝑤 " , 𝜉" ) 𝑤#%
local data 𝜉" 𝑔$ = 𝑔(𝑤, 𝜉$ )
local data 𝜉#
local data 𝜉# 𝑔% = 𝑔(𝑤, 𝜉% )
𝑤∗$ 𝑔% = 𝑔(𝑤 % ,𝜉% )
𝑤∗ 𝑤∗"
𝑤∗%

shared information in FL
protected information in FL
𝑤∗ = 𝑤! − 𝑖𝑡𝑒𝑟 𝑓(𝑔 𝑤, 𝜉# , 𝑔 𝑤, 𝜉$ , 𝑔 𝑤, 𝜉% ) 𝑤∗" = 𝑤#" − 𝑖𝑡𝑒𝑟( 𝝨(𝑔 𝑤 " , 𝜉" , 𝑔 𝑤 $ , 𝜉$ , 𝑔 𝑤 % , 𝜉% ))

a) Gradient descent in traditional federated learning. Partici- b) Confined Gradient Descent. Each participant strictly
pants jointly work on the same global model 𝑤 using the descent confines their own global models from their initializa-
computed by 𝑓 . Although the local gradients 𝑔𝑎 , 𝑔𝑏 , 𝑔𝑐 can be se- tion (𝑤0𝑎 , 𝑤0𝑏 , 𝑤0𝑐 ) to optimal values (𝑤∗𝑎 , 𝑤∗𝑏 , 𝑤∗𝑐 ). The con-
curely synthesized, by knowing 𝑤 and 𝑓 , the adversary is able to fined models descend in the same pace, and when CGD
derive information about the local raw data 𝜉𝑎 , 𝜉𝑏 , 𝜉𝑐 . converges, reach the bottom of the valley where the
centralized model is located. Any two confined models
keep the same distance throughout the training process.
In other words, 𝑤𝑎 , 𝑤𝑏 , 𝑤𝑐 would not become closer to
each other during descending, preventing any participant
from predicting models of others.

Figure 1: Comparison of CGD and the gradient descent in traditional FL. This figure does not differentiate each iteration: the
occurrences of traditional FL’s global model 𝑤𝑘 and confined models 𝑤𝑘𝑎 , 𝑤𝑘𝑏 , 𝑤𝑘𝑐 in all iterations are represented by 𝑤, 𝑤 𝑎 , 𝑤 𝑏 , 𝑤 𝑐 ,
and 𝑖𝑡𝑒𝑟 represents the sum-up of all iterations.

The gradients, if shared in plain text, are subject to information the same global model, and this is subject to membership inference,
leakage of the local training data. For example, model-inversion as revealed by Nasr et al. [35].
attacks [9, 32, 43] are able to restore training data from the gradients.
In the immediately following sections, we summarize the existing 2.3 Learning with differential privacy
privacy-preserving methods for synthesizing the local gradients,
Another line of studies that approaches to privacy-preserving FL
which fall into two broad categories, i.e., secure aggregation and
is through differential privacy (DP) mechanism [2, 8, 12, 16, 21, 42,
learning with differential privacy.
44, 48, 49]. The common practice of achieving differential privacy
2 . As
is based on additive noise calibrated to ∇𝐹 ’s sensitivity S ∇𝐹
2.2 Secure aggregation such, a differentially private learning framework can be achieved
by updating parameters with perturbed gradients at each iteration,
Secure aggregation typically employs cryptographic mechanisms for example, to update parameters as
such as homomorphic encryption (HE) [7, 31, 40] and/or secure
multiparty computation (MPC) [4, 10, 11, 30, 34, 47] to securely 1 2
𝑤𝑘+1 ← 𝑤𝑘 − 𝛼𝑘 (∇𝐹 (𝑤𝑘 , 𝜉𝑙,𝑘 ) + N (0, S ∇𝐹 · 𝜎 2 )), (3)
evaluate the gradient ∇𝐹 (𝑤𝑘 , 𝜉𝑙,𝑘 ) without revealing local data. |𝜉𝑙,𝑘 |
With ∇𝐹 (𝑤𝑘 , 𝜉𝑙,𝑘 ), all the participants can thus take a gradient
descent step by Equation 2. 2 ·𝜎 2 ) is the Gaussian distribution (a commonly used
where N (0, S ∇𝐹
Some existing studies fall into this category. For instance, Bonawitz noise distribution in differentially private learning frameworks [8])
et al. [4] present a secure aggregation protocol that allows a server with mean 0 and standard deviation S ∇𝐹 · 𝜎.
to compute the sum of user-held data vectors, which can be used The privacy loss is accumulated with repeated access to the
to aggregate user-provided model updates for a deep neural net- data during training epochs [2]. There is also an inherent tradeoff
work. Mohassel et al. [34] propose a secure two-party computation between privacy and utility of the trained model.
(2PC) protocol that supports secure arithmetic operations on shared In summary, in all of the above approaches, the global model
decimal numbers for calculating the gradient updates using SGD. has to be shared with each participant, leading to the leakage of
Existing FL frameworks employing HE/MPC are mainly based on information. This motivates CGD’s design to eliminate the explicit
federated SGD. All the participants in it share and update one and sharing of the central global model.
3
3 CONFINED GRADIENT DESCENT
CGD optimizes an objective function in FL with multiple local
centralized model
datasets. It starts with a colony of discrete points, and then uses the
combination of their gradients to lead the optimization to another
colony of points at the neighborhood of the global optimum. In
training data collected centrally
this section, we formalize this problem and present the workflow
of CGD optimization. a) Centralized Training

Parameter server
3.1 Problem formulation
3.1.1 Optimization objective. Consider a centralized dataset 𝜉 =
global model
{(𝑥𝑖 , 𝑦𝑖 )}𝑛𝑖=1 consisting of 𝑛 training samples. The goal of machine
pull parameters from push gradients via privacy
learning is to find a model parameter 𝑤 such that the overall loss, server and replace the preserving techniques,
+ corresponding local + such as MPC/HE/DP +
which is measured by the distance between the model prediction parameters mechanism

ℎ(𝑤, 𝑥𝑖 ) and the label 𝑦𝑖 for each (𝑥𝑖 , 𝑦𝑖 ) ∈ 𝜉, is minimized. This is


reduced to solving the following problem
𝑛 local parameters/gradients local parameters/gradients local parameters/gradients
1 ∑︁
arg min 𝐹 (𝑤; 𝜉) + 𝜆𝑧 (𝑤), (4)
𝑤 𝑛
𝑖=1 local training dataset local training dataset local training dataset

where 𝐹 (𝑤; 𝜉) is the loss function, and 𝑧 (𝑤) is the regularizer for 𝑤. b) Traditional Federated Learning
We use 𝑤 ∗ to denote the optimal solution of centralized training (i.e.,
the centralized model).
In the context of FL, we have a system of 𝑚 local participants,
each of which holds a private dataset 𝜉𝑙 ⊆ 𝜉 (𝑙 ∈ [1, 𝑚]) consisting confined
global model
of a part of the training dataset. The part could be a part of training ③,④
confined Σ confined
samples, a part of features that have common entries, or both. As- global model global model
③,④ ③,④
sume the training takes 𝑇 iterations, and let 𝑤𝑘𝑙 denote the confined Σ Σ
… local gradients

secure addition
model of participant 𝑙 at the 𝑘 𝑡ℎ iteration, where 𝑘 ∈ [1,𝑇 ]. Let ①

1 local gradients local training dataset local gradients


𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ) = ∇𝐹 (𝑤𝑘𝑙 , 𝜉𝑙 ) (5) ① ①
|𝜉𝑙 |
local training dataset local training dataset

represent the local gradient with respect to 𝜉𝑙 . We use 𝑤 ∗𝑙 to denote


Σ
the final confined model of participant 𝑙 when CGD converges. The ②secure addition ②secure addition
objective of CGD is to make 𝑤 ∗𝑙 located at a neighborhood of the
c) CGD
centralized model 𝑤 ∗ within a bounded gap.
3.1.2 Attacker setting. We assume an honest-but-curious white- Figure 2: Architectural comparison of the centralized train-
box adversary1 who may control 𝑡 out of 𝑚 participants (where ing, traditional federated learning, and CGD.
𝑡 ≤ 𝑚 − 2), including the aggregator for secure addition operation
(if any).
𝑤𝑘𝑙 is different and private. The model updating in CGD synthesizes
3.2 CGD optimization the information from all training samples by summing up the local
Figure 2 illustrates the architecture of CGD (Figure 2c), with a gradients, while in federated SGD (Figure 2b), each iteration takes
comparison to the centralized training (Figure 2a) and traditional into account only a subset of training samples.
FL (Figure 2b). In the centralized training, the datasets of all partic- To better position CGD, we summarize the related studies in
ipants are gathered for training a single model. In the traditional the literature in Table 1. We use federated SGD to represent the
FL, every participant owns its local training dataset, and updates SGD or the mini-batch update in the FL, including both the plain
the same global model 𝑤𝑘 via a parameter server using its local SGD in which the local gradient/model is shared in plaintext, and
model/gradients. The local gradients ∇𝐹 (𝑤𝑘 , 𝜉𝑙,𝑘 ) can be protected privacy-preserving SGD via secure aggregation or differential pri-
via either secure aggregation [4, 30, 31, 34, 40, 47] or differential vacy mechanisms. CGD guarantees that each confined model, when
privacy mechanisms [2, 8, 21, 42, 44, 48]. This process can be decen- CGD converges, is at the neighborhood of the centralized model,
tralized by replacing the parameter server with a peer-to-peer com- retaining the model accuracy (column 5 in Table 1 and proved in
munication mechanism [22, 39]. In CGD, each participant learns Section 4). It also achieves desired privacy preservation compared
its own confined model (represented by different colors), i.e., each to the traditional FL (column 6 in Table 1 and detailed in Section 5).
1A
Algorithm 1 outlines CGD optimization for training with the
white-box adversary knows the internals of the training algorithms such as the
neural network architecture, and can observe the intermediate computations during confined model 𝑤𝑘𝑙 . In general, the optimization process consists of
the training iterations. the following steps.
4
Methodology Desired properties
Technique
Architecture Model update Model Accuracy Privacy
Plain SGD [6, 38] All the particip- Update the model using the Theoretically guaranteed con- The sharing of local parame-
Federated ants jointly learn gradient computed on a subset
SGD vergence [6, 38]. ters/gradients is subject to model
one and the same of training samples in each inversion attack [9, 32, 43] .
Privacy-preserving global model. iteration. As above, since the mechanism The sharing of one and the same
SGD via secure aggrega- guarantees that the computa- global model (even though the lo-
tion [4, 30, 31, 34, 40, 47] tion result from ciphertext is cal gradients are protected) is still
the same from plaintext [4, 34]. subject to information leakage (re-
fer to [35] and Section 5).
Privacy-preserving Additive noise (in most cases) The privacy cost is accumulated
SGD via differential pri- affects the model accuracy [8]. with repeated accesses to the
vacy [2, 8, 21, 42, 44, 48] dataset [2].
Confined Gradient Descent Each participant The model update in each iter- Theoretically guaranteed con- Boosted privacy preservation by
learns and con- ation embraces the information vergence with bounded gap to eliminating the sharing of the
fines a different from all the training samples by the centralized model. global model.
global model. summing up the local gradients.
Table 1: A summary of differences between the federated SGD and CGD.

• Initialization. Each participant 𝑙 randomizes its own 𝑤 1𝑙 . Algorithm 1 Confined Gradient Descent Optimization
A default setting is to sample based on the Gaussian distri- 1: Input: Local training data 𝜉𝑙 (𝑙 ∈ [1, 𝑚]),
bution of mean 0 and the variance 1, which is the standard 2: number of training iterations 𝑇
weight initialization scheme used in most machine learning 3: Output: Confined global model parameters 𝑤 ∗𝑙 (𝑙 ∈ [1, 𝑚])
approaches [13, 27, 33]. To prevent the colluding participants 4: Initialize: 𝑘 ← 1, each participant 𝑙 randomizes its own 𝑤 1𝑙
from inferring others’ points by the knowledge of the av- 5: while 𝑘 ≤ 𝑇 do
erage distance among the confined models, we introduce a 6: for all participants 𝑙 ∈ [1, 𝑚] do in parallel
hyper-parameter 𝛿 𝑙 to control the interval range of initial 7: Compute the local gradient: 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 )
weights 𝑤 1𝑙 , i.e., 𝑤 1𝑙 ∼ 𝑁 [−𝛿 𝑙 , 𝛿 𝑙 ], and allows each partici- 𝑚
Í
pant 𝑙 to independently choose its own 𝛿 𝑙 . 8: Securely evaluate the sum: 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 )
𝑙=1
• Step 1. At each iteration 𝑘, every participant computes the 9: Choose a stepsize: 𝛼𝑘
local gradient 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ) with respect to its current confined 𝑙
𝑚
Í
10: Set the new iterate as: 𝑤𝑘+1 ← 𝑤𝑘𝑙 − 𝛼𝑘 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 )
model 𝑤𝑘𝑙 and own dataset 𝜉𝑙 using Equation 5. 𝑙=1
𝑚
Í 11: end for
• Step 2. Securely compute 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ) which is later used 12: 𝑘 ←𝑘 +1
𝑙=1
𝑙
for calculating 𝑤𝑘+1 (double lines in Figure 2c). 13: end while
This step presumes that computational tools exist for se-
curely evaluating the sum of a set of secret values to avoid
releasing local gradient 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ) in plain text. We refer to model, defined as
Section 5 for more detail. 𝑇
1 ∑︁
• Step 3. A scalar stepsize 𝛼𝑘 > 0 is chosen given an iteration R= (𝐹 (𝑤𝑘𝑙 ) − 𝐹 (𝑤 ∗ )) (6)
number 𝑘 ∈ [1,𝑇 ]. 𝑇
𝑘=1
• Step 4. Every participant takes a descent step on its own 𝑤𝑘𝑙
𝑚
Í 𝑚
Í 4.1 Assumptions
with 𝛼𝑘 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ), i.e., 𝑤𝑘+1
𝑙 ← 𝑤𝑘𝑙 − 𝛼𝑘 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ).
𝑙=1 𝑙=1 We make the following assumptions on the loss function 𝐹 . They all
are common assumptions in convergence analyses of most gradient-
based methods, and satisfied in a variety of widely used cost func-
tions [6], such as mean squared error (MSE) and cross entropy.

Assumption 1. (Lipschitz continuity). The loss function 𝐹 : R𝑑 →


4 CONVERGENCE ANALYSIS R is continuously differentiable and the gradient function of 𝐹 , namely,
In this section, we conduct a formal convergence analysis on CGD. ∇𝐹 : R𝑑 → R𝑑 , is Lipschitz continuous with Lipschitz constant 𝐿 > 0,
Convergence analysis has been extensively used in the literature [19–
∥∇𝐹 (𝑤) − ∇𝐹 (𝑤)∥
b 2 ≤ 𝐿∥𝑤 − 𝑤
b∥2 for all {𝑤, 𝑤
b } ⊂ R𝑑 . (7)
21] to prove the correctness of optimization algorithms. Through
the analysis, we demonstrate the bound of the distance between Intuitively, this assumption ensures that the gradient of 𝐹 does
an arbitrary 𝑤 ∗𝑙 learned by CGD and the centralized model 𝑤 ∗ . The not change arbitrarily in the course of descending, such that the
analysis is centered around a regret function R, which is the differ- gradient can be a proper indicator towards the optimum [6].
ence between the CGD’s training loss and the loss of the centralized
5
Assumption 2. (Strong convexity). The loss function 𝐹 : R𝑑 → R 4.3 Proof of Theorem 1
is strongly convex such that Our proof aims to identify an upper bound of R. To this end, we con-
sider the trend of the distance from 𝑤𝑘𝑙 to 𝑤 ∗ , which would shrink
b ≥ 𝐹 (𝑤) + ∇𝐹 (𝑤)𝑇 (𝑤
𝐹 (𝑤) b − 𝑤) for all {𝑤, 𝑤
b } ⊂ R𝑑 . (8)
as 𝑘 increases, if there is a bound existing. Since 𝑤𝑘𝑙 is updated
𝑚
A useful fact from Assumption 2 for our analysis is using
Í 𝑗
𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 ) (i.e., the descent in CGD), and 𝑤 ∗ is obtained
𝑗=1
(𝐹 (𝑤) − 𝐹 (𝑤 ∗ )) ≤ ⟨𝑤 − 𝑤 ∗, ∇𝐹 (𝑤)⟩ for all 𝑤 ⊂ R𝑑 , (9) by ∇𝐹 (i.e., the descent in the centralized training), the trend of the
distance therefore should be related to the deviation between these
where ⟨·,·⟩ denotes the inner product operation. two. Exploring this leads to the following lemma which describes
In addition, we adopt the same assumption on a bounded solution this relationship.
space used in related studies [6, 21].
Lemma 1. Let 𝑆𝑘+1 = 12 ∥𝑤𝑘+1 𝑙 − 𝑤 ∗ ∥ 22 and 𝑆𝑘 = 21 ∥𝑤𝑘𝑙 − 𝑤 ∗ ∥ 22 .
Assumption 3. (Bounded solution space). The set of {𝑤𝑘𝑙 }|𝑘 ∈ [1,𝑇 ] Let ∇𝐹 (·) = 𝑛 𝑖=1 ∇𝐹 (·, 𝜉𝑖 ). We have
1 Í𝑛
is contained in an open set over which 𝐹 is bounded below a scalar
𝐹𝑖𝑛𝑓 , such that
𝑚
(a) there exists a 𝐷 > 0, s.t., ∥𝑤𝑘𝑙 − 𝑤 ∗ ∥ 22 ≤ 𝐷 2 for all 𝑘, and 1 1
∑︁
𝑗
⟨𝑤𝑘𝑙 − 𝑤 ∗, ∇𝐹 (𝑤𝑘𝑙 )⟩ = 𝛼𝑘 ∥ 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 )∥ 22 − (𝑆𝑘+1 − 𝑆𝑘 )
(b) there exists a 𝐺 > 0, s.t., ∥𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 )∥ 22 ≤ 𝐺 2 for all 𝑤𝑘𝑙 ⊂ R𝑑 . 2 𝑗=1
𝛼 𝑘
𝑚
∑︁
𝑗
4.2 Main Theorem − ⟨𝑤𝑘𝑙 − 𝑤 ∗, 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 ) − ∇𝐹 (𝑤𝑘𝑙 )⟩.
𝑗=1
We first present our main theorem below, and leave its proof to
Section 4.3. It demonstrates our main result on the converge rate Proof.
of CGD.  
1
𝑆𝑘+1 − 𝑆𝑘 = 𝑙
∥𝑤𝑘+1 − 𝑤 ∗ ∥ 22 − ∥𝑤𝑘𝑙 − 𝑤 ∗ ∥ 22
Theorem 1. Given a cost function satisfying Assumptions 1 - 2
3, and a learning rate of 𝛼𝑘 = (𝑘+𝜇𝑇
𝛼
)2
(0 < 𝜇 < 1), the CGD 1
 𝑚
∑︁ 
𝑗
optimization gives the regret R = ∥𝑤𝑘𝑙 − 𝛼𝑘 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 ) − 𝑤 ∗ ∥ 22 − ∥𝑤𝑘𝑙 − 𝑤 ∗ ∥ 22
2 𝑗=1
1 ln |𝜇𝑇 + 1| 1
𝑚 𝑚
R = 𝑂 (𝜖 + + )
∑︁ ∑︁
(10) 𝑗 𝑗
𝜇𝑇 + 1 𝑇 = ∥𝛼𝑘 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 )∥ 22 − 𝛼𝑘 ⟨𝑤𝑘𝑙 − 𝑤 ∗, 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 )⟩
2 𝑗=1 𝑗=1
where 𝜖 = 𝑚∥ E (𝑤 1𝑙 − 𝑤 1 )∥.
𝑗 𝑚
1 ∑︁
𝑗
𝑗 ∈𝑚 = ∥𝛼 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 )∥ 22
2 𝑘 𝑗=1 (12)
The theorem implies the following two remarks. 𝑚
∑︁
𝑗
• Convergence rate. Both and1 ln |𝜇𝑇 +1|
approach 0 as − 𝛼𝑘 ⟨𝑤𝑘𝑙 − 𝑤 ∗, 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 ) + ∇𝐹 (𝑤𝑘𝑙 ) − ∇𝐹 (𝑤𝑘𝑙 )⟩
𝜇𝑇 +1 𝑇
𝑗=1
𝑇 increases, implying that CGD will converge toward the
𝑚
optimum. The convergence rate can be adjusted by the pa- 1 ∑︁
𝑗
= ∥𝛼𝑘 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 )∥ 22 − 𝛼𝑘 ⟨𝑤𝑘𝑙 − 𝑤 ∗, ∇𝐹 (𝑤𝑘𝑙 )⟩
rameter 𝜇 (the effect of 𝜇 is investigated in Section 7.2.2). 2 𝑗=1
• Bounded optimality gap. When CGD converges, the gap 𝑚
∑︁
between the confined models and the centralized model is − 𝛼𝑘 ⟨𝑤𝑘𝑙 − 𝑤 ∗,
𝑗
𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 ) − ∇𝐹 (𝑤𝑘𝑙 )⟩
bounded by 𝜖. CGD uses the initialization parameter 𝛿 𝑙 to 𝑗=1
determine the range of 𝑤 1𝑙 in the way of
Dividing the above equation by 𝛼𝑘 , we can prove the lemma. □
𝑤 1𝑙 = 𝛿 · Rand
𝑙 𝑙
for all 𝑙 ∈ (1, 𝑚), (11)
In the following, we give the proof of Theorem 1. It calculates
where Rand is the initialization scheme. A standard Rand a function that is greater than R based on the convexity of the
used in machine learning is to apply random sampling from objective function (Inequation 13). The function can be decomposed
the Gaussian distribution of mean 0 and the variance 1, into three terms based on Lemma 1 (Equation 14). We then explore
and a standard 𝛿 is √1 (where 𝑛 is the sample size) [13]. the boundedness of each term, and taking these bounds together
𝑛 concludes the proof.
In CGD, each participant determines its own Rand𝑙 and 𝛿 𝑙
independently to avoid leaking the average distance among Proof. By the definition of the regret function (Equation 6) and
the confined models. Our experiment finds that CGD keeps Equation 9 in Assumption 2, we have
robust (in terms of validation accuracy) even when the par-
𝑇 𝑇
ticipants select their 𝛿 𝑙 s uniformly at random in a range of 1 ∑︁ 1 ∑︁ 𝑙
R= (𝐹 (𝑤𝑘𝑙 ) − 𝐹 (𝑤 ∗ )) ≤ ⟨𝑤𝑘 − 𝑤 ∗, ∇𝐹 (𝑤𝑘𝑙 )⟩ (13)
( √10 , √
0.1 ) (cf. Section 7.2.1).
𝑇 𝑇
𝑛 𝑛 𝑘=1 𝑘=1
6
Applying Lemma 1 to Inequation 13 and multiplying it by 𝑇 , we Determining the bound of the third term is slightly complex. We
have list it as the following claim, and prove it soon after the proof of
𝑇  𝑚
Theorem 1.
∑︁ 1 ∑︁
𝑗
𝑇 ·𝑅 ≤ 𝛼𝑘 ∥ 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 )∥ 22
2 Claim 1.
𝑘=1 𝑗=1
𝑚  𝑇 𝑚
1 ∑︁
𝑗
∑︁ ∑︁
𝑗
− (𝑆𝑘+1 − 𝑆𝑘 ) − ⟨𝑤𝑘𝑙 − 𝑤 ∗, 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 ) − ∇𝐹 (𝑤𝑘𝑙 )⟩ − ⟨𝑤𝑘𝑙 − 𝑤 ∗, 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 ) − ∇𝐹 (𝑤𝑘𝑙 )⟩
𝛼𝑘 𝑗=1 𝑗=1
𝑘=1
𝑚
(22)
𝑇 𝑚 𝑇 ∑︁ 𝑇
∑︁ 1 ∑︁ ∑︁ 1 𝑗 2
≤ 𝛼𝑘 ∥
𝑗
𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 )∥ 22 − (𝑆𝑘+1 − 𝑆𝑘 ) < 𝐷𝐿∥𝑇 (𝑤 1𝑙 − 𝑤 1 )∥ + 2𝑚 𝐺𝐷𝐿( + ln |𝜇𝑇 + 1|)
2 𝛼 𝑘 𝑗=1
𝜇𝑇 + 1
𝑘=1 𝑗=1 𝑘=1
𝑇 𝑚
∑︁ ∑︁
𝑗 Combining Inequations 14, 16, 21 and Claim 1, and dividing by
− ⟨𝑤𝑘𝑙 − 𝑤 ∗, 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 ) − ∇𝐹 (𝑤𝑘𝑙 )⟩
T we obtain
𝑘=1 𝑗=1
(14) 𝑚
2𝛼𝑚 2𝐺 2 ∑︁
𝑗 1 ln |𝜇𝑇 + 1|
Inequation 14 can be decomposed into three terms. The first two R< + 𝐷𝐿∥ (𝑤 1𝑙 − 𝑤 1 )∥ + 2𝑚 2𝐺𝐷𝐿( + )
𝑚 𝑇 𝑗=1
𝜇𝑇 + 1 𝑇
𝑗
terms 𝑇𝑘=1 12 𝛼𝑘 ∥ 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 )∥ 22 and − 𝑇𝑘=1 𝛼1𝑘 (𝑆𝑘+1 − 𝑆𝑘 ) sums
Í Í Í
𝑗=1 1 ln |𝜇𝑇 + 1|
= 𝑂 (𝜖 + + ), concluding the proof.
up the model updates throughout the training iterations. The third 𝜇𝑇 + 1 𝑇
Í 𝑚
Í 𝑗
term − 𝑇𝑘=1 ⟨𝑤𝑘𝑙 − 𝑤 ∗, 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 ) − ∇𝐹 (𝑤𝑘𝑙 )⟩ measures the gap □
𝑗=1
of the gradients between CGD and the centralized training.
Next, we explore the boundedness of each term. For the first Proof of Claim 1.
term, we have
𝑇 𝑚 𝑇 Proof.
∑︁ 1 ∑︁
𝑗
∑︁ 1 𝛼
𝛼𝑘 ∥ 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 )∥ 22 ≤ 𝑚 2𝐺 2 (15)
2 𝑗=1
2 (𝑘 + 𝜇𝑇 ) 2 𝑇
∑︁ 𝑚
∑︁
𝑗
𝑘=1 𝑘=1 − ⟨𝑤𝑘𝑙 − 𝑤 ∗, 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 ) − ∇𝐹 (𝑤𝑘𝑙 )⟩
𝑇
𝛼𝑚 2𝐺 2 ∑︁ 1 𝑘=1 𝑗=1
= < 𝛼𝑚 2𝐺 2 .
2 (𝑘 + 𝜇𝑇 ) 2
𝑘=1
(16) 𝑇
∑︁ 𝑚
∑︁
𝑗 
= ⟨𝑤𝑘𝑙 − 𝑤 ∗, ∇𝐹 (𝑤𝑘𝑙 ) − 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 ) ⟩ (23)
Inequation 15 is based on Assumption 3(b), and Inequation 16 is 𝑘=1 𝑗=1
Í∞ 1
based on the solution to the Basel problem that 𝑥=1 𝑥2
< 2. 𝑇 𝑚
∑︁ ∑︁
𝑗 
For the second term, we have ≤ ∥𝑤𝑘𝑙 − 𝑤 ∗ ∥ · ∥ ∇𝐹 (𝑤𝑘𝑙 ) − 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 ) ∥ (24)
𝑇 𝑇 𝑘=1 𝑗=1
∑︁ 1 ∑︁ 1
− (𝑆 − 𝑆𝑘 ) = (𝑆 − 𝑆𝑘+1 ) (17) 𝑇 ∑︁
𝑚
𝛼𝑘 𝑘+1 𝛼𝑘 𝑘
∑︁
𝑗
𝑘=1 𝑘=1 ≤ ∥𝑤𝑘𝑙 − 𝑤 ∗ ∥ · 𝐿∥ (𝑤𝑘𝑙 − 𝑤𝑘 )∥ (25)
𝑇   𝑘=1 𝑗=1
∑︁ 1 1 𝑙 1 𝑙
= ∥𝑤 − 𝑤 ∗ ∥ 22 − ∥𝑤𝑘+1 − 𝑤 ∗ ∥ 22 𝑚
∑︁ 𝑚
∑︁
𝛼𝑘 2 𝑘 2 ≤ 𝐷𝐿∥
𝑗
(𝑤 1𝑙 − 𝑤 1 ) + ... + (𝑤𝑇𝑙 − 𝑤𝑇 )∥
𝑗
(26)
𝑘=1
(18) 𝑗=1 𝑗=1
𝑇
∑︁ 1
≤ ∥(𝑤𝑘𝑙 − 𝑤 ∗ ) − (𝑤𝑘+1
𝑙
− 𝑤 ∗ )∥ 22 𝑚
∑︁
𝑗
𝑘=1
2𝛼𝑘 = 𝐷𝐿∥𝑇 (𝑤 1𝑙 − 𝑤 1 )
(19) 𝑗=1
𝑇 𝑇 𝑚 −1 ∑︁
𝑇∑︁ 𝑚 𝑚
∑︁
∑︁ 1 ∑︁ 1 ∑︁ 𝑗 
= ∥𝑤𝑘𝑙 − 𝑤𝑘+1
𝑙
∥ 22 = ∥𝛼𝑘
𝑗
𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 )∥ 22 − 𝛼 (𝑇 −𝑘) 𝑘 ∇𝐹 (𝑤 𝑙(𝑇 −𝑘) ) − 𝑔 𝑗 (𝑤 (𝑇 −𝑘) , 𝜉 𝑗 ) ∥
2𝛼𝑘 2𝛼𝑘 𝑗=1 𝑘=1 𝑗=1 𝑗=1
𝑘=1 𝑘=1
(20) (27)
𝑇 𝑚
 𝑚
∑︁
𝑗
1 ≤ 𝐷𝐿 ∥𝑇 (𝑤 1𝑙 − 𝑤 1 )∥
∑︁ ∑︁
𝑗
= 𝛼𝑘 ∥ 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 )∥ 22 < 𝛼𝑚 2𝐺 2
2 𝑗=1 𝑗=1
𝑘=1
(21) −1 ∑︁
𝑇∑︁ 𝑚 𝑚
∑︁ 
𝑗 
+∥ 𝛼 (𝑇 −𝑘) 𝑘 ∇𝐹 (𝑤 𝑙(𝑇 −𝑘) ) − 𝑔 𝑗 (𝑤 (𝑇 −𝑘) , 𝜉 𝑗 ) ∥
Inequation 19 follows reverse triangle inequality, and Inequation 𝑘=1 𝑗=1 𝑗=1
21 reuses the result of the first term (cf. Equation 16). (28)
7
𝑚 −1
𝑇∑︁
∑︁
𝑗 uses additive sharing over Z232 for securely evaluating addition
≤ 𝐷𝐿∥𝑇 (𝑤 1𝑙 − 𝑤 1 )∥ + 𝐷𝐿 · 2𝑚 2𝐺 𝛼 (𝑇 −𝑘) 𝑘 (29) operations in a multiparty computation environment. It guarantees
𝑗=1 𝑘=1
the secrecy of the addends even though the majority (𝑚 − 2 out of
𝑚 −1
𝑇∑︁
∑︁
𝑗 𝑘 𝑚) of participants are compromised. An brief introduction of the
= 𝐷𝐿∥𝑇 (𝑤 1𝑙 − 𝑤 1 )∥ + 2𝑚 2𝐺𝐷𝐿 (30) additive secret sharing scheme can be found in Appendix A.
𝑗=1
((1 + 𝜇)𝑇 − 𝑘) 2
𝑘=1
In the rest of this section, we explore the privacy preservation of
𝑚
∑︁
𝑗 𝑇 CGD. We demonstrate CGD’s privacy enhancement over traditional
< 𝐷𝐿∥𝑇 (𝑤 1𝑙 − 𝑤 1 )∥ + 2𝑚 2𝐺𝐷𝐿( + ln |𝜇𝑇 + 1|) (31)
𝑗=1
𝜇𝑇 + 1 FL. We prove that less information is exposed in CGD compared to
that of traditional FL.
Inequations 24 and 28 are from triangle inequality. Inequation
𝑛 𝑚
25 is from the fact ∇𝐹 (𝑤𝑘𝑙 ) = 𝑛1
Í
∇𝐹 (𝑤𝑘𝑙 , 𝜉𝑖 ) =
Í
𝑔 𝑗 (𝑤𝑘𝑙 , 𝜉 𝑗 ) 5.1 Information exposed in CGD
𝑖=1 𝑗=1
Recall that the involved parties are a set L of 𝑚 participants denoted
and Assumption 1’s blockwise Lipschitz-continuity. Inequation
𝑇 with logical identities 𝑙 ∈ [1, 𝑚], and 𝑡 is the adversarial threshold
Í
26 is from Assumption 3(a) and represents (·) by a summand (𝑡 ≤ 𝑚 − 2) (Section 3.1.2). Let 𝑐 be any subset of L that includes
𝑘=1 the compromised and colluding parties.
sequence. Equation 27 comes from the fact
We demonstrate that, during optimization in CGD, given only
𝑗 the sum of the local gradients which are computed on different
𝑤𝑘𝑙 − 𝑤𝑘 = (𝑤 1𝑙 − 𝛼 1 ∇𝐹 (𝑤 1𝑙 ) − ... − 𝛼𝑘 ∇𝐹 (𝑤𝑘𝑙 ))
𝑚 𝑚
confined models, the adversary can learn no information other
∑︁ ∑︁
𝑗
− (𝑤 1 − 𝛼 1
𝑗
𝑔 𝑗 (𝑤 1 , 𝜉 𝑗 ) − ... − 𝛼𝑘
𝑗
𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 )) than their own inputs and the sum of the local gradients from other
Í
𝑗=1 𝑗=1 honest parties, i.e., 𝑔𝑘𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ) |𝑙 ∈L\𝑐 .
𝑚
∑︁ Our analysis is based on the simulation paradigm [14]. It com-
𝑗 𝑗
= (𝑤 1𝑙 − 𝑤 1 ) − 𝛼 1 (∇𝐹 (𝑤 1𝑙 ) − 𝑔 𝑗 (𝑤 1 , 𝜉 𝑗 )) pares what an adversary can do in a real protocol execution to what
𝑗=1 it can do in an ideal scenario, which is secure by definition. The
𝑚
∑︁ adversary in the ideal scenario, is called the simulator. An indistin-
𝑗
− ... − 𝛼𝑘 (∇𝐹 (𝑤𝑘𝑙 ) − 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 )). guishability between adversary’s view in real and ideal scenarios
𝑗=1 guarantees that it can learn nothing more than their own inputs
Inequation 29 follows Assumption 3(b) from which we obtain and the information required by the simulator for the simulation. A
𝑚
Í 𝑗 brief introduction of the simulation paradigm is given in Appendix
∥∇𝐹 (𝑤 𝑙(𝑇 −𝑘) )− 𝑔 𝑗 (𝑤 (𝑇 −𝑘) , 𝜉 𝑗 )∥ ≤ 2𝑚𝐺. Equation 30 is obtained B.
𝑗=1
by applying 𝛼 (𝑇 −𝑘) = (𝑇 −𝑘+𝜇𝑇𝛼 . Inequation 31 is from the follow- To facilitate the understanding on our analysis, we first present
)2 ′
ing fact. the used notations. Denote 𝑔𝑘L = {𝑔𝑘𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 )}𝑙 ∈L′ as the local
Since gradients of any subset of participants L ′ ⊆ L at 𝑘 𝑡ℎ iteration.
∫ 𝑏
𝑘 𝑏 𝑎 Let 𝑉 𝐼𝐸𝑊𝑟𝑒𝑎𝑙 (𝑔𝑘L , 𝑡, P, 𝑐) denote their combined views from the
𝑑𝑘 = + ln |𝑐 − 𝑏 | − − ln |𝑐 − 𝑎|, execution of a real protocol P. Let 𝑉 𝐼𝐸𝑊𝑖𝑑𝑒𝑎𝑙 (𝑔𝑘𝑐 , 𝑧, 𝑡, F𝑝 , 𝑐) denote
𝑎 (𝑐 − 𝑘)
2 𝑐 − 𝑏 𝑐 − 𝑎
the views of 𝑐 from an ideal execution that securely computes a
we have function F𝑝 , where 𝑧 is the information required by the simulator
∫ 𝑇 −1
𝑘 𝑇 −1 S in the ideal execution for simulation.
𝑑𝑘 =
1 ((1 + 𝜇)𝑇 − 𝑘) 2 (1 + 𝜇)𝑇 − (𝑇 − 1) The following theorem shows that when executing CGD with the
1 threshold 𝑡, the joint view of the participants in 𝑐 can be simulated
+ ln |(1 + 𝜇)𝑇 − (𝑇 − 1)| − − ln |(1 + 𝜇)𝑇 − 1|
(1 + 𝜇)𝑇 − 1 by their own inputs and the sum of the local gradients from the
Í Í
𝑇 −1 remaining honest nodes, i.e., 𝑔𝑘𝑙 |𝑙 ∈L\𝑐 . Therefore, 𝑔𝑘𝑙 |𝑙 ∈L\𝑐
< + ln |(1 + 𝜇)𝑇 − (𝑇 − 1)| (with 𝑇 ≥ 2) is the only information that the adversary can learn during the
(1 + 𝜇)𝑇 − (𝑇 − 1)
𝑇 execution.
< + ln |𝜇𝑇 + 1|
𝜇𝑇 + 1 Theorem 2. When executing CGD with the threshold 𝑡, there
□ exists a simulator S such that for L and 𝑐, with 𝑐 ⊆ L and |𝑐 |≤ 𝑡,
the output of S from 𝑉 𝐼𝐸𝑊𝑖𝑑𝑒𝑎𝑙 is perfectly indistinguishable from
5 PRIVACY PRESERVATION the output of 𝑉 𝐼𝐸𝑊𝑟𝑒𝑎𝑙 , namely
The participants in CGD have to share the sum of local gradients, 𝑉 𝐼𝐸𝑊𝑟𝑒𝑎𝑙 (𝑔𝑘L , 𝑡, P, 𝑐) ≡ 𝑉 𝐼𝐸𝑊𝑖𝑑𝑒𝑎𝑙 (𝑔𝑘𝑐 , 𝑧, 𝑡, F𝑝 , 𝑐)
𝑚
Í 𝑗
i.e., 𝑔 𝑗 (𝑤𝑘 , 𝜉 𝑗 ). A straightforward way is to let each participant where
𝑗=1 ∑︁
𝑧= 𝑔𝑘𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ) |𝑙 ∈L\𝑐 .
release its local gradient 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ),
but it may leak information
about 𝜉𝑙 or 𝑤𝑘𝑙 [47]. To address this, we incorporate the secure Proof. We define S through each training iteration as:
addition operation [3, 4, 29, 45] on the local gradients to calculate SIM1 : 𝑆𝐼𝑀 1 is the simulator for the first training iteration.
their sum without releasing each of them. We make use of the ad- Since the inputs of the parties in 𝑐 do not depend on the inputs of
ditive secret sharing scheme proposed by Bogdanov et al. [3], which the honest parties in L\𝑐, 𝑆𝐼𝑀 1 can produce a perfect simulation by
8
running 𝑐 on their true inputs, and L \ 𝑐 on a set of pseudorandom According to the chain rule in calculus, ∇𝐹 (𝑤𝑘 , 𝑥𝑙 , 𝑦𝑙 ) is com-
L\𝑐 𝜕𝐹 (ℎ (𝑥 𝑤 ),𝑦 ) 𝜕ℎ (𝑥 𝑤 ) 𝜕 (𝑥𝑙 𝑤𝑘 )
vectors 𝜂 1 = {𝜂𝑙1 }𝑙 ∈ L\𝑐 in a way that puted as 𝜕ℎ (𝑥𝑙 𝑤𝑘 ) 𝑙 𝜕 (𝑥 𝑙𝑤 𝑘) 𝜕𝑤 𝑘
, where 𝑥𝑙 𝑤𝑘 is matrix mul-
𝑙 𝑘 𝑙 𝑘
∑︁
L\𝑐
∑︁ ∑︁ ∑︁ ∑︁ tiplication of training samples 𝑥𝑙 and 𝑤𝑘 , and ℎ is the hypothesis
𝜂1 = 𝜂𝑙1 |𝑙 ∈ L\𝑐 = 𝑔1L − 𝑔𝑐1 = 𝑔𝑙1 |𝑙 ∈L\𝑐 . function which is determined by the learning model. For example,
in logistic regression, ℎ is usually a sigmoid function, while in neu-
Since each 𝑔𝑙1 (𝑤 1𝑙 , 𝜉𝑙 ) is computed from its respective confined
ral network, ℎ is a composite function that is known as forward
model 𝑤 1𝑙 which is randomized in the initialization, the pseudoran- 𝜕 (𝑥𝑙 𝑤𝑘 )
propagation. Let Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) = 𝜕𝐹 𝜕ℎ
𝜕ℎ 𝜕 (𝑥 𝑤 ) , and 𝜕𝑤𝑘 equals to
𝑚\𝑐 𝑙 𝑘
dom vectors 𝜂 1 generated by 𝑆𝐼𝑀 1 for the inputs of all parties
𝑥𝑙𝑇 . Then, Equation 33 can be written as
in L \ 𝑐, and the joint view of 𝑐 in 𝑉 𝐼𝐸𝑊𝑖𝑑𝑒𝑎𝑙 , will be identical to
that in 𝑉 𝐼 𝐸𝑊𝑟𝑒𝑎𝑙 , namely 𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) = 𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ), (35)
L\𝑐
∑︁ ∑︁ ∑︁
( 𝜂1 + 𝑔𝑐1 ) ≡ 𝑔1L , Plain federated SGD. In plain federated SGD, 𝑤𝑘 is updated as
Í the following (by combing Equation 1 and 35)
and the information required by 𝑆𝐼𝑀 1 is 𝑧 = 𝑔𝑙1 |𝑙 ∈ L\𝑐 .
𝑤𝑘+1 ← 𝑤𝑘 − 𝛼𝑘 𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ), (36)
SIM𝑘+1 (𝑘 ≥ 1): 𝑆𝐼𝑀 𝑘+1 is the simulator for the (𝑘 +1)𝑡ℎ training
iteration.
Í L in which the local gradient 𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) is shared among the
In 𝑅𝑒𝑎𝑙 execution, 𝑔𝑘+1 is computed as participants. As such, by knowing both 𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) and 𝑤𝑘 , the
∑︁
L
∑︁ ∑︁ ∑︁ adversary is able to derive indicative information about (𝑥𝑙 , 𝑦𝑙 ). For
𝑔𝑘+1 = 𝑙
𝑔𝑘+1 𝑙
(𝑤𝑘+1 , 𝜉 𝑙 ) |𝑙 ∈ L = 𝑙
𝑔𝑘+1 (𝑤𝑘𝑙 − 𝛼𝑘 𝑔𝑘L , 𝜉𝑙 ) |𝑙 ∈L
example, in linear regression, since 𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) = 𝑥𝑙𝑇 (𝑥𝑙 𝑤𝑘 −𝑦𝑙 ),
𝑘
∑︁ ∑︁ ∑︁ the adversary is able to obtain {𝑥𝑙𝑇 𝑥𝑙 , 𝑥𝑙𝑇 𝑦𝑙 }.
𝑔𝑖L ), 𝜉𝑙 |𝑙 ∈ L .
𝑙 
= 𝑔𝑘+1 𝑤 1𝑙 − (𝛼𝑖
Secure aggregated federated SGD. In this category of traditional
𝑖=1
(32) FL [4, 30, 31, 34, 40, 47], the local gradients is protected by secure
In 𝐼𝑑𝑒𝑎𝑙 execution, since each 𝑔𝑘+1𝑙 (𝑤 𝑙 , 𝜉 ) | is also com- aggregation, and the global model 𝑤𝑘 is updated as
𝑘+1 𝑙 𝑙 ∈ L ∑︁
puted from randomized 𝑤 1𝑙 , 𝑆𝐼𝑀 𝑘+1 can produce a perfect simula- 𝑤𝑘+1 ← 𝑤𝑘 − 𝛼𝑘 𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈D , (37)
tion by running the parties L \ 𝑐 on a set of pseudorandom vectors 𝑙
L\𝑐 𝑙 }
𝜂𝑘+1 = {𝜂𝑘+1 𝑙 ∈ L\𝑐 in a way that where D ⊆ L. By combing Equation 37 and 35, we have
∑︁
L\𝑐
∑︁ ∑︁ ∑︁ ∑︁ ∑︁
𝜂𝑘+1 = 𝑙
𝜂𝑘+1 |𝑙 ∈ L\𝑐 = L
𝑔𝑘+1 𝑐
− 𝑔𝑘+1 = 𝑙
𝑔𝑘+1 |𝑙 ∈L\𝑐 . 𝑤𝑘+1 ← 𝑤𝑘 − 𝛼𝑘 𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) |𝑙 ∈D , (38)
𝑙
As such, the joint view of 𝑐 in 𝑉 𝐼𝐸𝑊𝑖𝑑𝑒𝑎𝑙 , will be identical to
in which the aggregated gradient, 𝑥𝑙 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ), is shared among
Í 𝑇
that in 𝑉 𝐼 𝐸𝑊𝑟𝑒𝑎𝑙
𝑙
∑︁
L\𝑐
∑︁
𝑐
∑︁
L the participants.
( 𝜂𝑘+1 + 𝑔𝑘+1 )≡ 𝑔𝑘+1 ,
As the global model 𝑤𝑘 is also shared, by observing the changes
Í 𝑙
and the information required by 𝑆𝐼𝑀 𝑘+1 is 𝑧 = 𝑔𝑘+1 |𝑙 ∈L\𝑐 . of the aggregated gradient during training iterations, i.e., 𝑤𝑘 −𝑤𝑘+1 ,
By summarizing 𝑆𝐼𝑀 1 and 𝑆𝐼𝑀 𝑘+1 , the output of the simulator the adversary is still able to obtain indicative information about
𝑉 𝐼𝐸𝑊𝑖𝑑𝑒𝑎𝑙 of each training iteration is perfectly indistinguishable (𝑥𝑙 , 𝑦𝑙 ).
′ ′
from the output of 𝑉 𝐼𝐸𝑊𝑟𝑒𝑎𝑙 , and knowledge of 𝑧 is sufficient for Let 𝑥 L , 𝑦 L respectively denote the concatenated matrix of
the simulation, completing the proof. training samples {𝑥𝑙 }𝑙 ∈L′ , and labels {𝑦𝑙 }𝑙 ∈L′ of any subset of
□ participants L ′ ⊆ L. The following theorem shows that when
executing secure aggregated federated SGD with the threshold 𝑡,
the joint view of the participants in 𝑐 can be simulated by (1) the sum
5.2 Information exposed in traditional FL
of the local gradients from the remaining honest nodes in D, that
In this section, we demonstrate information exposed in traditional Í
is, 𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈D\𝑐 (2) and indicative information about (𝑥𝑙 , 𝑦𝑙 )
FL, including plain federated SGD, secure aggregated federated 𝑇
SGD, and differentially private federated SGD. in D \ 𝑐, that is, 𝑥 D\𝑐 Δ(𝑥 D\𝑐 𝑤𝑘 , 𝑦 D\𝑐 ). For example, in linear
Let regression, as 𝑥 D\𝑐 𝑇 𝑇
Δ(𝑥 D\𝑐 𝑤𝑘 , 𝑦 D\𝑐 ) = 𝑥 D\𝑐 (𝑥 D\𝑐 𝑤𝑘 −𝑦 D\𝑐 ),
𝑇 𝑇
1 1 the adversary is able to simulate {𝑥 D\𝑐 𝑥 D\𝑐 , 𝑥 D\𝑐 𝑦 D\𝑐 }.
𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) = ∇𝐹 (𝑤𝑘 , 𝜉𝑙 ) = ∇𝐹 (𝑤𝑘 , 𝑥𝑙 , 𝑦𝑙 ), (33)
|𝜉𝑙 | |𝑥𝑙 |
Theorem 3. When executing secure aggregated federated SGD
be the local gradient with respect to training dataset 𝜉𝑙 = (𝑥𝑙 , 𝑦𝑙 ). with the threshold 𝑡, there exists a simulator S such that for L, D and
In traditional FL, 𝑤𝑘 is the public global model shared among the 𝑐, with D ⊆ L, 𝑐 ⊆ L and |𝑐 |≤ 𝑡, the output of S from 𝑉 𝐼 𝐸𝑊𝑖𝑑𝑒𝑎𝑙
participants. is perfectly indistinguishable from the output of 𝑉 𝐼𝐸𝑊𝑟𝑒𝑎𝑙 , namely
For the sake of simplicity, we assume SGD is not generalized to ∑︁
mini-batch update, i.e., we have |𝑥1 | = 1. Then, Equation 33 can be 𝑉 𝐼𝐸𝑊𝑟𝑒𝑎𝑙 ( 𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈D , 𝑡, P, 𝑐)
𝑙
𝑙
written as,
𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) = ∇𝐹 (𝑤𝑘 , 𝑥𝑙 , 𝑦𝑙 ), (34) ≡ 𝑉 𝐼𝐸𝑊𝑖𝑑𝑒𝑎𝑙 (𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈𝑐 , 𝑧 1, 𝑧 2, 𝑡, F𝑝 , 𝑐)
9
where random variation among the proprietary global models of the par-
∑︁
D\𝑐 𝑇 D\𝑐 D\𝑐 ticipants. The variation hides each global model from other partici-
𝑧1 = 𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈ D\𝑐 , 𝑧 2 =𝑥 Δ(𝑥 𝑤𝑘 , 𝑦 )
pants, such that the privacy is enhanced in general.
In Table 2, we summarize the privacy enhancement from the
Proof. SIM𝑘 (𝑘 ≥ 1): 𝑆𝐼𝑀 𝑘 is the simulator for the 𝑘 𝑡ℎ training perspective of adversary’s observation, i.e, the exposed information
iteration. to the adversary during the optimization. In traditional FL, the shar-
Í
In 𝑅𝑒𝑎𝑙 execution, 𝑔𝑘 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈ D is computed as (with Equa- ing of global model 𝑤𝑘 , even if the local gradients are protected, is
tion 37 and 38) still subject to information leakage about the original dataset. By
∑︁ ∑︁ 1 eliminating the sharing of 𝑤𝑘 and letting each participant confine
𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈ D = 𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) |𝑙 ∈ D = (𝑤𝑘 − 𝑤𝑘+1 )
𝑙
𝛼 its own 𝑤𝑘𝑙 , CGD achieves boosted privacy over traditional privacy-
(39) preserving FL. (1) Compared to secure aggregated FL in which
In 𝐼𝑑𝑒𝑎𝑙 execution, since 𝑤𝑘 is shared among the participants, indicative information about original datasets can be observed via
by computing 𝑤𝑘 − 𝑤𝑘+1 , 𝑆𝐼𝑀 𝑘 can produce a perfect simulation the sharing of 𝑤𝑘 (Theorem 3), the variation among 𝑤𝑘𝑙 in CGD
by running the parties D \ 𝑐 on prevents such information from disclosure, and guarantees that
∑︁ 𝑇 only the sum of local gradients is exposed during the optimization
𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈ D\𝑐 , or, 𝑥 D\𝑐 Δ(𝑥 D\𝑐 𝑤𝑘 , 𝑦 D\𝑐 ), (Theorem 2). (2) Compared to differential privacy mechanism in
As such, the joint view of 𝑐 in 𝑉 𝐼𝐸𝑊𝑖𝑑𝑒𝑎𝑙 , will be identical to which privacy decays with the increasing training epochs, the vari-
that in 𝑉 𝐼 𝐸𝑊𝑟𝑒𝑎𝑙 , since ation introduced to each 𝑤𝑘𝑙 hides the local gradients throughout
∑︁ ∑︁ ∑︁ the whole training process, and thus retains privacy regardless of
𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈ D\𝑐 + 𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈𝑐 ≡ 𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈D the number of training epochs.
and,
𝑇
𝑥 D\𝑐 Δ(𝑥 D\𝑐 𝑤𝑘 , 𝑦 D\𝑐 ) + 𝑥 𝑐𝑇 Δ(𝑥 𝑐 𝑤𝑘 , 𝑦𝑐 ) 6 CASE STUDY: CONFINED GRADIENT
𝑇
DESCENT FOR A N-LAYER NEURAL
≡ 𝑥 D Δ(𝑥 D 𝑤𝑘 , 𝑦 D ) NETWORK
∑︁
≡ 𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) |𝑙 ∈ D CGD can be applicable to any machine learning algorithms that
𝑙 use gradient descent for optimization. In this section, we apply it to
Í 𝑙 a N-layer neural network to demonstrate its usability. We assume
Thus the information required by 𝑆𝐼𝑀 𝑘 is 𝑧 1 = 𝑔𝑘 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈D\𝑐
𝑇 that a centralized dataset 𝜉 is horizontally and vertically partitioned
and 𝑧 2 = 𝑥 D\𝑐 Δ(𝑥 D\𝑐 𝑤𝑘 , 𝑦 D\𝑐 ). and distributed to 𝑚 participants where 𝑚 = (𝑚ℎ × 𝑚 𝑣 ), i.e., the
𝑇
Together, we have { 𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈ D\𝑐 , 𝑥 D\𝑐 Δ(𝑥 D\𝑐 𝑤𝑘 , 𝑦 D\𝑐 )}
Í
number of horizontal partitions multiplied by the number of vertical
being the information that the adversary can learn during the exe- partitions. The participant 𝑙 owns a private part of the training
cution. □ dataset, denoted by 𝜉𝑙 (𝑖,𝑗 ) (𝑖 ∈ [1, 𝑚ℎ ], 𝑗 ∈ [1, 𝑚 𝑣 ]), as well as its
(1)𝑙 (𝑁 )𝑙
Differentially private federated SGD. In most differentially pri- confined model parameters, denoted by 𝑤𝑘 (𝑖,𝑗 ) ,...,𝑤𝑘 (𝑖,𝑗 ) . Since
vate federated SGD, the local gradients are protected by additive each participant owns different confined models and proportion of
noise mechanism as the dataset, the training prediction 𝑦b𝑙 (𝑖,𝑗 ) is also different and kept
confined in its owner (shown in Figure 3).
𝑤𝑘+1 ← 𝑤𝑘 − 𝛼𝑘 (𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) + N𝑘 ), (40) Algorithm 2 presents the detailed algorithm. The participant 𝑙
(1)𝑙 (𝑖,𝑗 ) (𝑁 )𝑙 (𝑖,𝑗 )
where N𝑘 denote the noise added at the 𝑘 𝑡ℎ iteration. By combing first randomly initializes its confined model 𝑤 1 ,...,𝑤 1
𝑑 ×𝐻
Equation 35, it can be written as (line 3). The size of the model in the first layer is 𝑤 (1)𝑙 (𝑖,𝑗 ) ∈ R 𝑙 𝑗 1 ,
𝑤𝑘+1 ← 𝑤𝑘 − 𝛼𝑘 (𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) + N𝑘 ), (41) where 𝑑𝑙 𝑗 is the number of features in 𝜉𝑙 (𝑖,𝑗 ) , and the size of models
in the remaining layers is 𝑤 (𝑟 )𝑙 (𝑖,𝑗 ) ∈ R𝐻𝑟 −1 ×𝐻𝑟 (𝑟 ∈ [2, 𝑁 ]). It is
The information exposed among participants is 𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) + possible that different participants have different size of 𝑤 (1)𝑙 (𝑖,𝑗 ) be-
N𝑘 , and the additive noise N𝑘 prevent one from directly deriving cause the number of features 𝑑𝑙 𝑗 held by each participant may differ,
𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) by subtracting 𝑤𝑘 and 𝑤𝑘+1 . However, with repeated
while 𝑤 (𝑟 )𝑙 (𝑖,𝑗 ) (𝑟 ∈ [2, 𝑁 ]) keep the same size in each participant.
access to the datasets during training epochs, 𝜖 (the parameter of
Next, we detail the training process in each iteration. In the for-
privacy loss) accumulates, i.e., privacy degrades, as the effect of
ward propagation, each participant separately computes the output
added noise being canceled out [2, 8]. (1)𝑙 (𝑁 )𝑙
of each layer 𝑎𝑘 (𝑖,𝑗 ) , ..., 𝑎𝑘 (𝑖,𝑗 ) based on its own confined models
5.3 Enhanced privacy over traditional FL (line 5 to 11). In the backward propagation, each participant solely
(𝑁 )𝑙 (1)𝑙
Traditional FL requires all participants to update the same global computes the local gradient of each layer 𝑔𝑘 (𝑖,𝑗 ) , ...𝑔𝑘 (𝑖,𝑗 ) which
model during the training process. Every participant thus sees the is computed from its own private dataset and confined model (line
identical intermediate results, as the same aggregated gradients 12 to 18). Then, they securely evaluate the sum of local gradients
are shared. This is the root cause of most privacy threats against from the 𝑁 𝑡ℎ layer to the 2𝑛𝑑 layer (line 19 to 21). For the first
FL. CGD breaks the mode of single global model, by introducing layer gradient, since the size of 𝑤 (1)𝑙 (𝑖,𝑗 ) can be different, the sum
10
Table 2: A summary of the exposed information in traditional FL and CGD

Techniques The observation of an honest-but-curious Exposed indicative information Boosted privacy of CGD over traditional FL
white-box adversary during training iter- about (𝑥𝑙 , 𝑦𝑙 ) (in the example of
ations linear regression)
Plain federated Local gradients: 𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ), which equals {𝑥𝑙𝑇 𝑥𝑙 , 𝑥𝑙𝑇 𝑦𝑙 } In plain federated SGD, indicative information
SGD 𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) about the local training dataset can be observed.
𝑇 𝑇
Secure aggregated Sum of the local gradients from the remaining {𝑥 D\𝑐 𝑥 D\𝑐 , 𝑥 D\𝑐 𝑦 D\𝑐 } With the shared global 𝑤𝑘 , indicative information
federated SGD
Í 𝑙
honest participants: 𝑔𝑘 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈D\𝑐 , about the concatenated training datasets from
and with a shared 𝑤𝑘 , it equals honest participants can be observed.
𝑇
𝑥 D\𝑐 Δ(𝑥 D\𝑐 𝑤𝑘 , 𝑦 D\𝑐 ) (Theorem 3).
Differentially Perturbed local gradients: 𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) + N𝑘 , Not applicable with the single access The decay of privacy with increasing number of
private federated which equals 𝑥𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) + N𝑘 to (𝑥𝑙 , 𝑦𝑙 ). However, with repeated ac- training epochs is one of the limitations of most
𝑙
SGD (additive cess to (𝑥𝑙 , 𝑦𝑙 ) during epochs, the ad- additive-noise based differentially private learn-
noise based mech- versary is able to get a more accurate ing.
anism) estimate of 𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ), which can
be used to obtain {𝑥𝑙𝑇 𝑥𝑙 , 𝑥𝑙𝑇 𝑦𝑙 }.
CGD Sum of the local gradients from the remaining Not applicable. In CGD, each local gradient is computed from
Í
honest participants: 𝑔𝑘𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ) |𝑙 ∈L\𝑐 (The- a different and private 𝑤𝑘𝑙 , and only the sum of
orem 2). local gradients is exposed during the training. As
such, (1) the indicative information exposed in
the secure aggregation cannot be derived in CGD;
(2) the randomness introduced in each 𝑤𝑘𝑙 hides
the information about local gradient throughout
the training, and thus privacy does not decay.

DP-FL, and approach that of the centralized training. The settings


𝜉! (",$) 𝑤 (#)! (",$) …… 𝑤 (%)! (",$) ŷ! (",$) of all trainings are listed below.
• CGD training. The model is trained using Algorithm 1 on
the dataset which is partitioned and distributed to 𝑚 par-
Figure 3: Local data and confined model of a N-layer neural ticipants, where 𝑚 = (𝑚ℎ × 𝑚 𝑣 ). We observe that the per-
network in participant 𝑙 (𝑖,𝑗) (𝑖 ∈ 𝑚ℎ , 𝑗 ∈ 𝑚 𝑣 ). formance of all confined models are proximal, and thus we
report the worst performance in this section.
• Centralized training (the baseline). We take the perfor-
(1)𝑙
of the local gradients 𝑔𝑘 𝑗 is therefore taken among the vertically mance of centralized training as the baseline. In this setting,
partitioned participant groups 𝑙 𝑗 = {𝑙 (1,𝑗) , ..., 𝑙 (𝑚ℎ ,𝑗) } (line 22 to 24). the training is on the entire dataset using the batch update.
In the descent, after choosing a learning rate 𝛼𝑘 (line 25), the first • Local training. In this setting, the data and model are dis-
(1)𝑙 tributed to 𝑚 participants in the same way as in CGD. Each
layer is updated by taking a descent step of 𝛼𝑘 𝑔𝑘 𝑗 (line 26 to 28),
and the remaining layers are updated by taking a descent step of participant separately trains their local models from their
(𝑟 ) local data, without sharing the gradients.
𝛼𝑘 𝑔𝑘 (𝑟 ∈ [2, 𝑁 ]) (line 29 to 31).
• DP-FL. We take the scheme proposed by Abadi et al. [2],
one of the state-of-the-art differentially private FL schemes,
7 EXPERIMENTS in our experiments. Due to the different setting of network
We implement CGD and evaluate its performance of model accuracy. architecture and hyper-parameters, its centralized baseline is
It is implemented using C++, and we use Eigen library [15] to slightly different from ours. Therefore, we report the margin
handle matrix operations and ZeroMQ library [18] for distributed between its performance and the performance of its baseline.
messaging. Our evaluation focuses on the performance of CGD
in terms of validation loss and accuracy, and the influence of the 7.1.1 MNIST. Our first set of experiments are conducted on the
factors (the parameters 𝛿 and 𝜇) on its performance. standard MNIST dataset which is a benchmark for handwritten
digit recognition. It has 60, 000 training samples and 10, 000 test
7.1 Performance Evaluation samples, each with 784 features representing 28 × 28 pixels in the
image. We use a fully-connect neural network (FNN) with ReLU of
We evaluate validation loss and accuracy on two popular benchmark
256 units and softmax of 10 classes with cross-entropy loss.
datasets: MNIST [28] and CIFAR-10 [26]. We compare CGD with the
We conduct our experiments with CGD’s default settings:
training on aggregated data (referred to as the centralized training),
the training on each participant’s local data only (referred to as • 𝜇 = 0, leading to a fixed learning rate which is in line with
the local training), and FL with differential privacy (referred to as most machine learning algorithms,
the DP-FL). CGD is expected to outperform the local training and • 𝛿 = 0.1 (cf. Equation 11), and
11
Algorithm 2 Confined Gradient Descent of a N-layer fully-
connected neural network for 𝑚 = (𝑚ℎ × 𝑚 𝑣 ) participants 1.0
7 Centralized training
CGD training
Input: Local training data 𝜉𝑙 (𝑖,𝑗 ) (𝑖 ∈ 𝑚ℎ , 𝑗 ∈ 𝑚 𝑣 ), activation 6 Local training 0.8

Validation accuracy
1:

Validation loss
5
functions of 𝑁 layers: 𝜎 (1) ...𝜎 (𝑁 ) , cost function 𝐽 , the number 4 0.6

of training iterations 𝑇 3
0.4
2
2: Output: Confined global model parameters of 𝑁 layers: Centralized training
1 0.2 CGD training
(1)𝑙 (𝑁 )𝑙 Local training
𝑤 ∗ (𝑖,𝑗 ) ,...,𝑤 ∗ (𝑖,𝑗 )
(𝑖 𝑗 ∈ 𝑚ℎ , ∈ 𝑚𝑣 ) 0
0 500 1000 1500 2000 0 500 1000 1500 2000
3: Initialize: 𝑘 ← 1, each participant 𝑙 (𝑖,𝑗) randomizes its own Training iterations Training iterations
(1)𝑙 (𝑁 )𝑙 a) 𝑚 = (10 × 7) b) 𝑚 = (10 × 7)
𝑤 1 (𝑖,𝑗 ) ,...,𝑤 1 (𝑖,𝑗 )
4: while 𝑘 ≤ 𝑇 do
1.0
Forward propagation 7 Centralized training
CGD training
for all participants 𝑙 (𝑖,𝑗) do in parallel 6 Local training 0.8

Validation accuracy
5:

Validation loss
5
(1)𝑙 (1)𝑙 0.6 Centralized training
6: 𝑧𝑘 (𝑖,𝑗 ) ← 𝜉𝑙 (𝑖,𝑗 ) 𝑤𝑘 (𝑖,𝑗 ) 4 CGD training
3 Local training
(1)𝑙 (1)𝑙 0.4
7: 𝑎𝑘 (𝑖,𝑗 ) ← 𝜎 (1) (𝑧𝑘 (𝑖,𝑗 ) ) 2
(2)𝑙 (1)𝑙 (2)𝑙 1 0.2
8: 𝑧𝑘 (𝑖,𝑗 ) ← 𝑎𝑘 (𝑖,𝑗 ) 𝑤𝑘 (𝑖,𝑗 ) 0
0 500 1000 1500 2000 0 500 1000 1500 2000
9: ...... Training iterations Training iterations
(𝑁 )𝑙 (𝑁 )𝑙 (𝑁 )𝑙
10: 𝑎𝑘 (𝑖,𝑗 )
← 𝜎 (𝑁 ) (𝑧𝑘 (𝑖,𝑗 ) )), 𝑦b𝑙 (𝑖,𝑗 )
← 𝑎𝑘 (𝑖,𝑗 ) c) 𝑚 = (100 × 49) d) 𝑚 = (100 × 49)

11: end for


Backward propagation Centralized training
1.0
7 CGD training
Calculate the local gradients of each layer 6 Local training 0.8

Validation accuracy
for all participants 𝑙 (𝑖,𝑗) do in parallel
Validation loss
12: 5
0.6 Centralized training
𝑙 𝑙 (𝑖,𝑗 )
(𝑁 )𝑙 (𝑖,𝑗 ) 4 CGD training
(𝑁 )𝑙 (𝑖,𝑗 ) b (𝑖,𝑗 ) ,𝑦
𝜕𝐽 ( 𝑦 ) 𝜕𝑎𝑘 3 Local training
𝛿𝑘 ← (𝑁 )𝑙 (𝑖,𝑗 ) (𝑁 )𝑙 (𝑖,𝑗 ) 0.4
2
𝜕𝑎𝑘 𝜕𝑧
(𝑁 )𝑙 (𝑖,𝑗 )
𝑘 1 0.2
(𝑁 )𝑙 (𝑖,𝑗 ) (𝑁 )𝑙 (𝑖,𝑗 ) 𝜕𝑧𝑘 0
13: 𝑔𝑘 ← 𝛿𝑘 (𝑁 )𝑙 (𝑖,𝑗 )
0 500 1000
Training iterations
1500 2000 0 500 1000
Training iterations
1500 2000
𝜕𝑤𝑘
e) 𝑚 = (1000 × 112) f) 𝑚 = (1000 × 112)
14: for 𝑟 = 𝑁 − 1, ..., 1 do
(𝑟 +1)𝑙 (𝑖,𝑗 ) (𝑟 )𝑙 (𝑖,𝑗 )
(𝑟 )𝑙 (𝑖,𝑗 ) (𝑟 +1)𝑙 (𝑖,𝑗 ) 𝜕𝑧𝑘 𝜕𝑎𝑘
15: 𝛿𝑘 ← 𝛿𝑘 (𝑟 )𝑙 (𝑖,𝑗 ) (𝑟𝑎)𝑙 (𝑖,𝑗 ) Figure 4: Results on the validation loss and accuracy for dif-
𝜕𝑎𝑘 𝜕𝑧𝑘 ferent number of participants on the MNIST dataset.
(𝑟 )𝑙 (𝑖,𝑗 )
(𝑟 )𝑙 (𝑖,𝑗 ) (𝑟 )𝑙 (𝑖,𝑗 ) 𝜕𝑧𝑘
16: 𝑔𝑘 ← 𝛿𝑘 (𝑟 )𝑙 (𝑖,𝑗 )
𝜕𝑤𝑘
17: end for • the standard weight initialization scheme based on the Gauss-
18: end for ian distribution of mean 0 and variance 1.
Securely evaluate the sum of local gradients Figure 4 summarizes the performance with varying number of par-
19: for all participants 𝑙 (𝑖,𝑗) do in parallel ticipants. In general, CGD stays close to the centralized baseline
(𝑟 ) Í ℎ ,𝑚 𝑣 (𝑟 )𝑙 (𝑖,𝑗 ) in both validation loss and accuracy, and as expected, significantly
20: 𝑔𝑘 ← 𝑚 𝑖,𝑗=1 𝑔𝑘 , (𝑟 ∈ [2, 𝑁 ])
outperforms the local training. In the worst case of 𝑚 = (1000×112)
21: end for
when CGD contains the greatest number of participants, i.e., each
22: for all groups 𝑙 𝑗 = {𝑙 (1,𝑗) , ..., 𝑙 (𝑚ℎ ,𝑗) } do in parallel
Í ℎ (1)𝑙 (𝑖,𝑗 ) participant owns a small proportion consisting of 60 samples with 7
(1)𝑙
23: Evaluate the sum of first layer: 𝑔𝑘 𝑗 ← 𝑚 𝑖=1 𝑔𝑘 features, CGD still achieves 0.209 and 93.74% in validation loss and
24: end for accuracy respectively, close to 0.081 and 97.54% of the centralized
Descent baseline. In contrast, the performance of local training declines to
25: Choose a stepsize: 𝛼𝑘 2.348 and 11.64%. Table 3 lists the detailed performance comparison
26: for all groups 𝑙 𝑗 = {𝑙 (1,𝑗) , ..., 𝑙 (𝑚ℎ ,𝑗) } do in parallel with both centralized and local trainings, and Table 4 lists the com-
(1)𝑙 (1)𝑙 (𝑖,𝑗 ) (1)𝑙 𝑗 parison with DP-FL. More results are given with varying number
27: 𝑤𝑘+1 (𝑖,𝑗 ) ← 𝑤𝑘 − 𝛼𝑘 𝑔𝑘
of participants in Appendix C (Figure 9 and Table 8).
28: end for
29: for all participants 𝑙 (𝑖,𝑗) do in parallel 7.1.2 CIFAR-10. Our second set of experiments are conducted on
(𝑟 )𝑙 (𝑟 )𝑙 (𝑖,𝑗 ) (𝑟 ) the CIFAR-10 dataset. It consists of 60000 32 × 32 colour images
30: 𝑤𝑘+1 (𝑖,𝑗 ) ← 𝑤𝑘 − 𝛼𝑘 𝑔𝑘 , (𝑟 ∈ [2, 𝑁 ])
31: end for in 10 classes (e.g., airplane, bird, and cat), with 6000 images per
32: 𝑘 ←𝑘 +1 class. The images are divided into 50000 training images and 10000
33: end while test images. We use the ResNet-56 architecture proposed by He
et al. [17]. It takes as input images of size 32 × 32, with the per-
pixel mean subtracted. Its first layer is 3 × 3 convolutions, and then
12
Table 3: Performance comparison on MNIST with default
settings.
12 Centralized training
CGD training 0.7
10 Local training

Validation accuracy
Validation loss Validation accuracy 0.6

Validation loss
8
0.5
Centralized 0.081 97.54%
6 0.4
CGD Local training CGD Local training 4 0.3
m=(10 × 7) 0.143 1.283 96.5% 54.39% Centralized training
2 0.2 CGD training
m=(100 × 49) 0.189 2.076 94.21% 23.53% 0.1 Local training
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000
m=(1000 × 112) 0.209 2.348 93.74% 11.64% Training iterations Training iterations
a) 𝑚 = (10 × 2) b) 𝑚 = (10 × 2)
Table 4: Comparison between CGD and traditional FL via dif-
ferential privacy in terms of the difference of validation ac- 12 Centralized training
curacy to the centralized baseline on MNIST. CGD training 0.7
10 Local training

Validation accuracy
0.6

Validation loss
8
0.5 Centralized training
CGD 6 CGD training
0.4 Local training
Participants Difference to the baseline 4 0.3
m=(10 × 7) 1.0% 2 0.2
m=(100 × 49) 3.3% 0.1
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000
m=(1000 × 112) 3.8% Training iterations Training iterations
Traditional FL via DP [2] c) 𝑚 = (100 × 16) d) 𝑚 = (100 × 16)
Noise levels Difference to the baseline1
𝜖 = 8 (small noise) 1.3%
12 Centralized training
𝜖 = 2 (medium noise) 3.3% CGD training 0.7
10 Local training

Validation accuracy
𝜖 = 0.5 (large noise) 8.3% 0.6
Validation loss

8
1 The centralized baseline of validation accuracy on the MINIST in [2] is 98.3%. 0.5 Centralized training
6 CGD training
0.4 Local training
4 0.3
2 0.2
a stack of 3 × 18 layers is used, with 3 × 3 convolutions of filter
0.1
sizes {32, 16, 8} respectively, and 18 layers for each filter size. The 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000
Training iterations Training iterations
numbers of filters are {16, 32, 64} in each stack. The network ends e) 𝑚 = (1000 × 32) f) 𝑚 = (1000 × 32)
with an average pooling layer, a fully-connected (FC) layer with
1024 units, and softmax. Our training data augmentation follows Figure 5: Results on the validation loss and accuracy for dif-
the setting in [17]. For each training image, we generate a new ferent number of participants on the CIFAR-10 dataset.
distorted image by randomly flipping the image horizontally with
probability 0.5, and adding 4 amounts of padding to all sides before
cropping the image to the size of 32 × 32.
Since our focus is on evaluating the proposed CGD, rather than
enhancing the state-of-the-art analysis on CIFAR-10, we utilize the Table 5: Performance comparison on CIFAR-10.
transferability2 of convolutional layers to save the computational
cost of computing per-example gradients. We follow the experi- Validation loss Validation accuracy
Centralized 0.675 75.72%
ment setting of Abadi et al. [2], which treats the CIFAR-10 as the
private dataset and CIFAR-100 as a public dataset. CIFAR-100 has CGD Local training CGD Local training
the same image types as CIFAR-10, and it has 100 classes contain- m=(10 × 2) 0.720 0.980 74.79% 65.3%
m=(100 × 16) 0.723 2.008 74.42% 26.25%
ing 600 images each. We use CIFAR-100 to train a network with
m=(1000 × 32) 0.724 2.434 74.29% 16.19%
the aforementioned architecture, and freeze the parameters of the
convolutional layers and retrain only the last FC layer on CIFAR-10.
Training on the entire dataset with batch update reaches the
validation accuracy of 75.75%, which is taken as our centralized
training baseline3 . For the CGD training, each participant feeds the
pre-trained convolutional layers with their private data partitions, Figure 5 and Table 5 summarize our experimental results against
generates the input features to the FC layer, and randomly initializes centralized training and local training. The results on the validation
the confined model parameters of the FC layer. They then take the loss and accuracy are generally in line with that on the MNIST
input to the FC layer as the private local training data, and train a dataset. In the worst case of 𝑚 = (1000 × 32), CGD achieves 0.724
model using Algorithm 1 with 𝛿 = 0.01 and 𝑇 = 6000. and 74.29% in validation loss and accuracy respectively, which are
relatively near to the centralized baseline (0.675 and 75.72%). Table 6
2 Transfer learning allows the analyst to take a model trained on one dataset and summarize the results against DP-FL. In line with the results on
transfer it to another without retraining [41].
3 We note that by making the network deeper or using other advanced techniques, MNIST, the accuracy difference to the centralized baseline in CGD
better accuracy can be obtained, with the state-of-the-art being about 99.37% [23]. is smaller than that in DP-FL.
13
Table 6: Comparison between CGD and the traditional FL on
CIFAR-10.
Centralized training
0.8 delta=0.1 0.95
CGD delta=0.06

Validation accuracy
delta=0.01

Validation loss
Participants Difference to the baseline 0.6 0.90
m=(10 × 2) 0.93%
0.4 0.85
m=(100 × 16) 1.30%
Centralized training
m=(1000 × 32) 1.43% 0.80 delta=0.1
0.2 delta=0.06
Traditional FL via DP [2] delta=0.01
Difference to the baseline1 0 500 1000 1500 2000 0 500 1000 1500 2000
Noise levels Training iterations Training iterations
𝜖 = 8 (small noise) 7% a) Validation loss b) Validation accuracy
𝜖 = 4 (medium noise) 10%
𝜖 = 2 (large noise) 13%
1 Figure 6: Results on the validation loss and accuracy for dif-
The centralized baseline of validation accuracy on the CIFAR-10 in [2] is 80%.
ferent initialization parameter 𝛿 with 𝑚 = (1000 × 112) on the
MNIST dataset.

7.2 Influencing Factors of CGD


In this section, we study the influence of initialization and the
parameter 𝜇 on CGD performance. This study is conducted with Centralized training
m=(1000*112) 0.95
0.8 m=(100*49)
the MNIST dataset. 0.90

Validation accuracy
m=(10*7)

Validation loss
0.6 0.85
7.2.1 Role of the initialization. According to Theorem 1, the gap 0.80
0.4
between CGD solution and the centralized model is bounded by 0.75 Centralized training
𝑗
𝜖 = 𝑚∥ E (𝑤 1𝑙 −𝑤 1 )∥, in which each 𝑤 1𝑙 is decided by 𝛿 in Equation 0.70 m=(1000*112)
0.2 m=(100*49)
𝑗 ∈𝑚 0.65 m=(10*7)
11. Therefore, we investigate how the initialization setting affects 0 500 1000 1500 2000 0 500 1000 1500 2000
Training iterations Training iterations
CGD’s performance.
a) Validation loss b) Validation accuracy
To this end, we conduct two experiments in which we mutate
parameter 𝛿 while keeping other settings unchanging (i.e., 𝜇 = 0
Figure 7: Results on the validation loss and accuracy for vari-
and 𝑇 = 2000). In the first experiment, we let all participants use the
ous number of participants with 𝛿 selected uniformly at ran-
same 𝛿 in {0.1, 0.06, 0.01, 0.001}, as they are around √ 1 (recall
60000 dom on the MNIST dataset.
that 60000 is the sample size of MNIST). In the other experiment, we
let each participant randomly select its own 𝛿 based on the uniform
Table 7: Performance with different initialization parameter
distribution within the range from 0.001 to 0.1.
𝛿 for different number of participants on the MNIST dataset.
The results of our first experiment are shown in Figure 6 and
the first five columns in Table 7. In general, as 𝛿 decreases, CGD
Validation loss
achieves better performance. This confirms our expectation: de- 𝛿 = 0.1 𝛿 = 0.06 𝛿 = 0.01 𝛿 = 0.001
𝑗 Random
creasing 𝛿 would reduce the value of ∥ E (𝑤 1𝑙 − 𝑤 1 )∥, such that Centralized 0.080 0.074 0.078 0.111
𝑗 ∈𝑚ℎ m=(10 × 7) 0.143 0.106 0.092 0.121 0.110
the confined models are closer to the centralized optimum. We m=(100 × 49) 0.188 0.125 0.094 0.129 0.124
have not observed significant difference from 𝛿 = 0.1, 0.06 and 0.01. m=(1000 × 112) 0.209 0.132 0.096 0.117 0.136
In the case of 𝑚 = (1000 × 112), the validation accuracy and loss Validation accuracy
achieve 97.08% and 0.096 with 𝛿 = 0.01 which outperform 93.74% 𝛿 = 0.1 𝛿 = 0.06 𝛿 = 0.01 𝛿 = 0.001
Random
Centralized 97.54% 97.79% 97.61% 96.69%
and 0.209 in the setting of 𝛿 = 0.1, but both are close to the cen- m=(10 × 7) 95.60% 96.72% 97.21% 96.38% 96.59%
tralized model whose validation accuracy and loss are 97.61% and m=(100 × 49) 94.22% 96.05% 97.15% 96.19% 96.13%
0.078 respectively. However, 𝛿 cannot be set too small, in order to m=(1000 × 112) 93.74% 95.91% 97.08% 96.59% 95.82%
maintain numerical stability in neural network [13]. For example,
when we lower 𝛿 to 0.001, the performance starts decreasing.
The results of our second experiment are shown in Figure 7 conduct an experiment to investigate this relation. In this experi-
and the last column in Table 7. When the participants uniformly ment, we tune 𝜇 while keeping 𝛿 fixed as 0.1. For the stepsize (𝛼𝑘 ),
randomize their 𝛿s from 0.001 to 0.1, the performance of CGD we run CGD with a fixed 𝛼 till it (approximately) reaches a preferred
is still comparable with the centralized mode. For example, with point, and then continue our experiment with 𝛼 1 = 𝛼. This is to
𝑚 = (1000×112) participants, CGD achieves 95.82% and 0.136 in the keep CGD practical, as each time the stepsize is diminished, more
validation accuracy and loss. This suggests that CGD keeps robust iterations are required. Therefore, the first 6000 iterations is run
when the 𝛿s of its participants differ by two orders of magnitude. with 𝛼 = 0.01, and then diminish 𝛼𝑘 in each of the following 8000
iterations.
7.2.2 Effect of the parameter 𝜇. According to Theorem 1, the con- Figure 8 demonstrates the results with varying 𝜇 for 𝑚 = (100 ×
vergence speed of CGD is affected by the parameter 𝜇. We thus 49) and 𝑚 = (1000 × 112). For the case of 𝑚 = (100 × 49), 𝜇 ≥ 0.01
14
[6] Léon Bottou, Frank E Curtis, and Jorge Nocedal. 2018. Optimization methods for
large-scale machine learning. Siam Review 60, 2 (2018), 223–311.
0.959 0.957
0<mu<0.01 0<mu<=0.05 [7] Yi-Ruei Chen, Amir Rezapour, and Wen-Guey Tzeng. 2018. Privacy-preserving
0.01=<mu<1 0.05<mu<1
0.958
0.956 ridge regression on distributed data. Information Sciences 451 (2018), 34–49.
Validation accuracy

Validation accuracy
0.955 [8] Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differ-
0.957 ential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4
0.954 (2014), 211–407.
0.956 0.953 [9] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. 2015. Model inversion
0.952
attacks that exploit confidence information and basic countermeasures. In Pro-
0.955
ceedings of the 22nd ACM SIGSAC Conference on Computer and Communications
6000 7000 8000 9000 10000 11000 12000
0.951
6000 7000 8000 9000 10000 11000 12000 13000 14000 Security. 1322–1333.
Training iterations Training iterations [10] Adrià Gascón, Phillipp Schoppmann, Borja Balle, Mariana Raykova, Jack Doerner,
a) m=(100 × 49) b) m=(1000 × 112) Samee Zahur, and David Evans. 2016. Secure Linear Regression on Vertically
Partitioned Datasets. IACR Cryptology ePrint Archive 2016 (2016), 892.
[11] Adrià Gascón, Phillipp Schoppmann, Borja Balle, Mariana Raykova, Jack Do-
Figure 8: Validation accuracy with different parameter 𝜇 for erner, Samee Zahur, and David Evans. 2017. Privacy-preserving distributed
different number of participants on the MNIST dataset. linear regression on high-dimensional data. Proceedings on Privacy Enhancing
Technologies 2017, 4 (2017), 345–364.
[12] Robin C Geyer, Tassilo Klein, and Moin Nabi. 2017. Differentially private federated
learning: A client level perspective. arXiv preprint arXiv:1712.07557 (2017).
[13] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training
gives a slightly faster speed. CGD reaches the validation accuracy deep feedforward neural networks. In Proceedings of the thirteenth international
of 95.9% at 11401𝑡ℎ iteration, 2340 iterations (16.7% less) faster than conference on artificial intelligence and statistics. 249–256.
[14] Oded Goldreich, Silvio Micali, and Avi Wigderson. 2019. How to play any mental
training with 𝜇 < 0.01. For the case of 𝑚 = 1000 × 112), 𝜇 > 0.05 game, or a completeness theorem for protocols with honest majority. In Providing
gives a faster speed. CGD reaches the validation accuracy 95.68% at Sound Foundations for Cryptography: On the Work of Shafi Goldwasser and Silvio
11821𝑡ℎ iteration, 2140 iterations (15.28% less) faster than training Micali. 307–328.
[15] Gaël Guennebaud, Benoît Jacob, et al. 2010. Eigen v3. http://eigen.tuxfamily.org.
with 𝜇 ≤ 0.05. Even though such slight difference is observed, our [16] Inken Hagestedt, Yang Zhang, Mathias Humbert, Pascal Berrang, Haixu Tang, Xi-
experiment suggests that the effect of tuning 𝜇 is relatively limited. aoFeng Wang, and Michael Backes. 2019. MBeacon: Privacy-Preserving Beacons
for DNA Methylation Data.. In NDSS.
The gain of the validation accuracy is only within 0.2% in the cost [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
of around 2000 iterations. learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
[18] Pieter Hintjens. 2013. ZeroMQ: messaging for many applications. " O’Reilly Media,
8 CONCLUSION Inc.".
[19] Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B
We have presented CGD, a novel optimization algorithm for learn- Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More effective
ing confined models to enhance privacy for federated learning. distributed ml via a stale synchronous parallel parameter server. In Advances in
Privacy preservation is achieved against an honest-but-curious ad- neural information processing systems. 1223–1231.
[20] Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R
versary even though the majority (𝑚 − 2 out of 𝑚) of participants Ganger, Phillip B Gibbons, and Onur Mutlu. 2017. Gaia: Geo-distributed machine
are corrupted. We formally proved the convergence of CGD opti- learning approaching {LAN } speeds. In 14th {USENIX } Symposium on Networked
mization. In our experiments, we achieved validation accuracy of Systems Design and Implementation ( {NSDI } 17). 629–647.
[21] Yaochen Hu, Di Niu, Jianming Yang, and Shengping Zhou. 2019. FDML: A
97.01% with (1000 × 112) participants on the MNIST dataset, and collaborative machine learning framework for distributed features. In Proceedings
74.4% with (1000 × 32) participants on the CIFAR-10 dataset. Both of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining. 2232–2240.
are comparable to the performance of centralized training. [22] Hyesung Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun Kim. 2019.
A number of future work directions are of interest. In particular, Blockchained on-device federated learning. IEEE Communications Letters 24, 6
we see new research opportunities in applying our techniques to (2019), 1279–1283.
[23] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung,
asynchronous FL, allowing different participants to be at different Sylvain Gelly, and Neil Houlsby. 2019. Big transfer (bit): General visual represen-
iterations of model updates up to a bounded delay. We are also con- tation learning. arXiv preprint arXiv:1912.11370 6, 2 (2019), 8.
sidering other types of deep networks, for example, unsupervised [24] Jakub Konečnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. 2016.
Federated optimization: Distributed machine learning for on-device intelligence.
neural networks such as autoencoders. arXiv preprint arXiv:1610.02527 (2016).
[25] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik,
Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies
REFERENCES for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, [26] Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features
Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. from tiny images. (2009).
2016. Tensorflow: Large-scale machine learning on heterogeneous distributed [27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2017. Imagenet classi-
systems. arXiv preprint arXiv:1603.04467 (2016). fication with deep convolutional neural networks. Commun. ACM 60, 6 (2017),
[2] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, 84–90.
Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In [28] Yann LeCun and Corinna Cortes. 2010. MNIST handwritten digit database.
Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications http://yann.lecun.com/exdb/mnist/. (2010). http://yann.lecun.com/exdb/mnist/
Security. ACM, 308–318. [29] Hsiao-Ying Lin and Wen-Guey Tzeng. 2005. An efficient solution to the million-
[3] Dan Bogdanov, Sven Laur, and Jan Willemson. 2008. Sharemind: A framework aires’ problem based on homomorphic encryption. In International Conference on
for fast privacy-preserving computations. In European Symposium on Research in Applied Cryptography and Network Security. Springer, 456–466.
Computer Security. Springer, 192–206. [30] Jian Liu, Mika Juuti, Yao Lu, and Nadarajah Asokan. 2017. Oblivious neural
[4] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan network predictions via minionn transformations. In Proceedings of the 2017 ACM
McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2017. Prac- SIGSAC Conference on Computer and Communications Security. ACM, 619–631.
tical secure aggregation for privacy-preserving machine learning. In Proceedings [31] Tilen Marc, Miha Stopar, Jan Hartman, Manca Bizjak, and Jolanda Modic. 2019.
of the 2017 ACM SIGSAC Conference on Computer and Communications Security. Privacy-Enhanced Machine Learning with Functional Encryption. In European
ACM, 1175–1191. Symposium on Research in Computer Security. Springer, 3–21.
[5] Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural networks: Tricks [32] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov.
of the trade. Springer, 421–436. 2019. Exploiting unintended feature leakage in collaborative learning. In 2019
15
IEEE Symposium on Security and Privacy (SP). IEEE, 691–706. held by corresponding participants 𝑆 1, ..., 𝑆𝑠 . To do this, each par-
[33] Dmytro Mishkin and Jiri Matas. 2015. All you need is a good init. arXiv preprint ticipant 𝑆𝑖 executes a randomised sharing algorithm 𝑆ℎ𝑟 (𝑠𝑟𝑡𝑖 , 𝑆) to
arXiv:1511.06422 (2015). 1 , ..., 𝐸𝑠 , and distributes each
[34] Payman Mohassel and Yupeng Zhang. 2017. Secureml: A system for scalable split its secret 𝑠𝑟𝑡𝑖 into shares 𝐸𝑠𝑟𝑡 𝑖 𝑠𝑟𝑡𝑖
privacy-preserving machine learning. In 2017 IEEE Symposium on Security and 𝑗
Privacy (SP). IEEE, 19–38.
𝐸𝑠𝑟𝑡𝑖 to the participant 𝑆 𝑗 . Then, each 𝑆𝑖 locally adds the shares
𝑖 , ..., 𝐸𝑖 Í𝑠 𝑖 𝑖
[35] Milad Nasr, Reza Shokri, and Amir Houmansadr. 2019. Comprehensive privacy it holds 𝐸𝑠𝑟𝑡 1 𝑠𝑟𝑡𝑠 and produces 𝑗=1 𝐸𝑠𝑟𝑡 𝑗 (denoted by 𝐸 for
analysis of deep learning: Passive and active white-box inference attacks against
centralized and federated learning. In 2019 IEEE Symposium on Security and brevity). After that, a reconstruction algorithm 𝑅𝑒𝑐 ({(𝐸𝑖 , 𝑆𝑖 )}𝑆𝑖 ∈𝑆 ),
Privacy (SP). IEEE, 739–753. which takes 𝐸𝑖 from each participant and add them together, can
[36] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael P Wellman. Í
2018. Sok: Security and privacy in machine learning. In 2018 IEEE European be executed by an aggregator to reconstruct the 𝑠𝑖=1 𝑠𝑟𝑡𝑖 without
Symposium on Security and Privacy (EuroS&P). IEEE, 399–414. revealing any secret addends 𝑠𝑟𝑡𝑖 .
[37] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019.
Pytorch: An imperative style, high-performance deep learning library. arXiv
preprint arXiv:1912.01703 (2019).
[38] Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method.
The annals of mathematical statistics (1951), 400–407.
[39] Abhijit Guha Roy, Shayan Siddiqui, Sebastian Pölsterl, Nassir Navab, and Chris-
tian Wachinger. 2019. Braintorrent: A peer-to-peer environment for decentralized
federated learning. arXiv preprint arXiv:1905.06731 (2019). Appendix B SIMULATION PARADIGM
[40] Sagar Sharma and Keke Chen. 2019. Confidential boosting with random lin-
ear classifiers for outsourced user-generated data. In European Symposium on In simulation paradigm (a.k.a., the real/ideal model) [14], the se-
Research in Computer Security. Springer, 41–65. curity of a protocol is proved by comparing what an adversary
[41] Hoo-Chang Shin, Holger R Roth, Mingchen Gao, Le Lu, Ziyue Xu, Isabella Nogues,
Jianhua Yao, Daniel Mollura, and Ronald M Summers. 2016. Deep convolutional
can do in a real protocol execution to what it can do in an ideal
neural networks for computer-aided detection: CNN architectures, dataset char- scenario, which is secure by definition. Formally, a protocol P se-
acteristics and transfer learning. IEEE transactions on medical imaging 35, 5 curely computes a functionality F𝑝 , if for every adversary A in the
(2016), 1285–1298.
[42] Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In real model, there exists an adversary S in the ideal model, such
Proceedings of the 22nd ACM SIGSAC conference on computer and communications that the view of the adversary from a real execution VIEWreal is
security. ACM, 1310–1321. indistinguishable from the view of the adversary from an ideal exe-
[43] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Mem-
bership inference attacks against machine learning models. In 2017 IEEE Sympo- cution VIEWideal . The adversary S in the ideal model, is called the
sium on Security and Privacy (SP). IEEE, 3–18. simulator. An indistinguishability between VIEWreal and VIEWideal
[44] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. 2013. Stochastic gra-
dient descent with differentially private updates. In 2013 IEEE Global Conference
guarantees that the adversary can learn nothing more than their
on Signal and Information Processing. IEEE, 245–248. own inputs and the information required by S for the simulation.
[45] Maha Tebaa, Saïd El Hajji, and Abdellatif El Ghazi. 2012. Homomorphic encryp- In other words, the information required by S for the simulation
tion applied to the cloud computing security. In Proceedings of the World Congress
on Engineering, Vol. 1. 4–6. is the only information that can leak to adversary A from the real
[46] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine execution.
learning: Concept and applications. ACM Transactions on Intelligent Systems and Let 𝑐 denote the set of corrupted parties. The simulator performs
Technology (TIST) 10, 2 (2019), 1–19.
[47] Yanjun Zhang, Guangdong Bai, Xue Li, Caitlin Curtis, Chen Chen, and Ryan KL the following operations:
Ko. 2020. PrivColl: Practical Privacy-Preserving Collaborative Machine Learning.
In European Symposium on Research in Computer Security. Springer, 399–418.
[48] Yanjun Zhang, Guangdong Bai, Mingyang Zhong, Xue Li, and Ryan Ko. 2020.
Differentially private collaborative coupling learning for recommender systems.
IEEE Intelligent Systems (2020).
[49] Huadi Zheng, Qingqing Ye, Haibo Hu, Chengfang Fang, and Jie Shi. 2019. BDPL:
A Boundary Differentially Private Layer Against Machine Learning Model Extrac-
tion Attacks. In European Symposium on Research in Computer Security. Springer,
• Generate dummy inputs {𝜂𝑙 } for each honest party 𝑙 ∉ 𝑐 and
66–83. receives the actual inputs {𝑥 𝑙 } of corrupted parties 𝑙 ∈ 𝑐;
• Run P over {𝜂𝑙 } (𝑙 ∉ 𝑐 ) and {𝑥 𝑙 } (𝑙 ∈ 𝑐) and add all messages
sent/received by corrupted parties to VIEWideal ;
Appendix A ADDITIVE SECRET SHARING
• Send the inputs of corrupted parties {𝑥 𝑙 } (𝑙 ∈ 𝑐) to the trusted
SCHEME third party;
Secret sharing schemes aim to securely distribute secret values • Receive the outputs of corrupted users {𝑦𝑙 } (𝑙 ∈ 𝑐) from the
amongst a group of participants. CGD employs the secret sharing trusted third party and add them to VIEWideal .
scheme proposed by [3], which uses additive sharing over Z232 . In
1 , ..., 𝐸𝑠 ∈ Z 32
this scheme, a secret value 𝑠𝑟𝑡 is split to 𝑠 shares 𝐸𝑠𝑟𝑡 𝑠𝑟𝑡 2
such that
1 2
𝐸𝑠𝑟𝑡 + 𝐸𝑠𝑟𝑡 𝑠
+ ... + 𝐸𝑠𝑟𝑡 ≡ 𝑠𝑟𝑡 mod 232, (42)
Meanwhile, a real instance of P is executed with actual inputs for
𝑖1
and any 𝑠 − 1 elements 𝐸𝑠𝑟𝑡 𝑖𝑠−1
, ..., 𝐸𝑠𝑟𝑡 are uniformly distributed. This all parties, and VIEWreal is created by gathering inputs of corrupted
prevents any participant who has part of the shares from deriving parties, messages sent/received by corrupted parties during the
the value of 𝑠𝑟𝑡, unless all participants join their shares. protocol and their final outputs. Once the simulation is finished, the
In addition, the scheme has a homomorphic property that allows security is proved by showing that VIEWideal is indistinguishable
efficient and secure addition on a set of secret values 𝑠𝑟𝑡 1, ..., 𝑠𝑟𝑡𝑠 from VIEWreal .
16
Appendix C SUPPLEMENTARY Table 8: Supplementary experimental results on the valida-
EXPERIMENTAL RESULTS tion loss and accuracy for different number of participants
on the MNIST dataset in the default setting.

Validation loss Validation accuracy


1.0
7 Centralized training Centralized 0.081 97.54%
CGD training
6 Local training 0.8

Validation accuracy
CGD Local training CGD Local training
Validation loss

5
0.6
4 m=(10 × 1) 0.143 0.278 95.62% 92.01%
3
2
0.4 m=(100 × 1) 0.188 0.549 94.22% 82.89%
Centralized training
1 0.2 CGD training m=(1000 × 1) 0.208 1.566 93.76% 50.11%
0 Local training
0 500 1000 1500 2000 0 500 1000 1500 2000
Training iterations Training iterations
a) 𝑚 = (10 × 1) b) 𝑚 = (10 × 1)

1.0
7 Centralized training
CGD training
6 Local training 0.8
Validation accuracy
Validation loss

5
4 0.6
3
0.4
2
Centralized training
1 0.2 CGD training
0 Local training
0 500 1000 1500 2000 0 500 1000 1500 2000
Training iterations Training iterations
c) 𝑚 = (100 × 1) d) 𝑚 = (100 × 1)

1.0
7 Centralized training
CGD training
6 Local training 0.8
Validation accuracy
Validation loss

5
4 0.6
3
0.4
2
Centralized training
1 0.2 CGD training
0 Local training
0 500 1000 1500 2000 0 500 1000 1500 2000
Training iterations Training iterations
e) 𝑚 = (1000 × 1) f) 𝑚 = (1000 × 1)

Figure 9: Supplementary experimental results on the valida-


tion loss and accuracy for different number of participants
on the MNIST dataset in the default setting.

17

You might also like