Professional Documents
Culture Documents
Confined Gradient Descent Privacy-Preserving Optim
Confined Gradient Descent Privacy-Preserving Optim
ABSTRACT mode) [25]. During the training process, the participants work on
Federated learning enables multiple participants to collaboratively the same intermediate global model via a coordinating server (in
train a model without aggregating the training data. Although the the centralized FL) [2, 4, 34] or a peer-to-peer communication
training data are kept within each participant and the local gradi- scheme (in the decentralized FL) [22, 39]. Each of them obtains
ents can be securely synthesized, recent studies have shown that the current model parameters, works out a local gradient based on
such privacy protection is insufficient. The global model parame- the local data, and disseminates it to update the global model syn-
ters that have to be shared for optimization are susceptible to leak chronously [2, 4] or asynchronously [21]. This paradigm guarantees
information about training data. In this work, we propose Confined data locality, but has been found insufficient for data privacy: al-
Gradient Descent (CGD) that enhances privacy of federated learn- though the local gradients can be securely synthesized via a variety
ing by eliminating the sharing of global model parameters. CGD of techniques such as differential privacy (DP) [2, 21, 42, 48], secure
exploits the fact that a gradient descent optimization can start with multi-party communication (MPC) [4, 11, 34], and homomorphic
a set of discrete points and converges to another set at the neigh- encryption (HE) [31, 34, 40], the global model parameters that have
borhood of the global minimum of the objective function. It lets the to be shared are still susceptible to information leakage(cf. Section
participants independently train on their local data, and securely 5 and [35, 36]).
share the sum of local gradients to benefit each other. We formally This work further decreases the dependency among participants
demonstrate CGD ’s privacy enhancement over traditional FL. We by eliminating the explicit sharing of the central global model which
prove that less information is exposed in CGD compared to that of is the root cause of the information leakage [9, 35]. We propose
traditional FL. CGD also guarantees desired model accuracy. We the- a new optimization algorithm named Confined Gradient Descent
oretically establish a convergence rate for CGD. We prove that the (CGD) that enables each participant to learn a proprietary global
loss of the proprietary models learned for each participant against model. The CGD participants maintain their global models locally,
a model learned by aggregated training data is bounded. Extensive which are strictly confined within themselves from the beginning
experimental results on two real-world datasets demonstrate the of and throughout the whole training process. We refer to these
performance of CGD is comparable with the centralized learning, localized global models as confined models, to distinguish them from
with marginal differences on validation loss (mostly within 0.05) the global model in traditional FL.
and accuracy (mostly within 1%). CGD is inspired by an observation on the surface of the typi-
cal cost function. The steepness of the first derivative decreases
1 INTRODUCTION slower when approaching the minimum of the function, due to the
small values in the Hessian (i.e., the second derivative) near the
The performance of machine learning largely relies on the availabil-
optimum [5]. This gives the function, when plotted, a flat valley
ity of large representative datasets. To take advantage of massive
bottom. As such, a gradient descent algorithm A, when applied
data owned by multiple entities, federated learning (FL) is pro-
on an objective function 𝐹 , could start with a set of discrete points
posed [24, 25, 46]. It enables participants to jointly train a global
(referred to as a colony and their distance is discussed later). Itera-
model without the necessity of sharing their datasets, demonstrat-
tively descending the colony using the joint gradient of the colony
ing the potential to address the issues of data privacy and data
would lead A to the neighborhood of 𝐹 ’s minimum in the “flat
ownership. It has been incorporated by popular machine learning
valley bottom”. The points in the colony would also end up with
tools such as TensorFlow [1] and PyTorch [37], and increasingly
similar losses that are close to the loss of the minimum.
spread over various industries.
In Figure 1, we illustrate a holistic comparison between the
The privacy preservation of FL stems from its parallelization
workflow of CGD and that of a gradient decent in traditional FL. In
of the gradient descent optimization, which in essence is an ap-
traditional FL, every participant updates the same global model 𝑤
plication of stochastic gradient descent (SGD) (or the mini-batch
using their local gradients 𝑔𝑎 , 𝑔𝑏 , 𝑔𝑐 . In CGD, each participant 𝑙 first
∗ The corresponding author. independently initializes the starting point of its confined model
1
𝑤 1𝑙 . Then, in every training iteration, participants independently any FL schemes regardless of their underlying machine/deep
compute the local gradient from their current confined model and learning algorithms. It also eliminates the necessity of a cen-
local data, and then jointly work out the sum of all local gradients tral coordinating server, such that the optimization can be
and use it to update their confined models (the equation in Figure conducted in a fully decentralized manner.
1b). By doing this, CGD aims to enhance privacy without sacrificing • Convergence Analysis. We theoretically establish a con-
much model accuracy. For the sake of simplicity, we refer to these vergence rate for CGD under realistic assumptions on the
two properties as privacy and accuracy. loss function (such as convexity). We prove CGD converges
• Privacy. CGD should ensure that, throughout the training toward the centralized model as the number of iterations
process, neither local data of a participant nor intermediate increases. The distance between the trained confined models
results computed on them can be observed by other partici- and the centralized model is bounded, and can be tuned by
pants or an aggregator (if any). the hyper-parameter setting.
• Accuracy. The prediction made by any confined model • Enhanced Privacy Preservation Over Traditional FL.
should approach the centralized model that were to learn With secrecy of both confined model and local gradients,
centrally on the gathered data. CGD achieves enhanced privacy preservation over tradi-
tional FL. We prove that in CGD, given only the sum of the
The desired privacy enhancement of CGD stems from two aspects, local gradients, an honest-but-curious white-box adversary,
i.e., secrecy of confined models and secrecy of local gradients. For who may control 𝑡 out of 𝑚 participants (where 𝑡 ≤ 𝑚 − 2)
the former, besides always hiding the confined models from each including the aggregator for secure addition operation (if
other, each participant independently initializes its 𝑤 1𝑙 at random. any), can learn no information other than their own inputs
During the training process, any two confined models keep the same and the sum of the local gradients from other honest parties,
distance and never become closer to each other after descending, whereas in traditional FL, extra indicative information about
preventing any participant from predicting models of others. To local data can be obtained.
further boost the unpredictability, each participant could select its • Functional Evaluations. We implement CGD and conduct
own interval range of initial weights to avoid leaking the average experiments on two popular benchmark datasets MNIST and
distance between the confined models. CGD withstands interval CIFAR-10. The results demonstrate that CGD can closely
ranges differing by two orders of magnitude. For the latter, CGD approach the performance of centralized model on the vali-
incorporates the secure addition operation on the local gradients dation loss and accuracy, with marginal differences on vali-
to calculate their sum. This has been proved to be viable through dation loss (mostly within 0.05) and accuracy (mostly within
the additive secret sharing scheme, in which the sum of a set of 1%).
secret values is collaboratively calculated without revealing any
addends [3, 4, 29, 45]. A previous study [47] demonstrates it is 2 BACKGROUND AND RELATED WORKS
efficient when applied to achieve decentralization in FL. We prove
that the adversary’s observation in CGD is only the sum of local CGD is an optimization method based on the gradient decent. There-
gradients, and it conceals the extra indicative information that fore, in this section, we review the existing techniques for gradient
traditional FL would leak (cf. Section 5). updates in the traditional FL.
We formally prove that CGD ensures convergence. It converges
to confined models that are adjacent to the centralized model, and 2.1 Stochastic gradient descent
the adjacency is bounded (cf. Section 4). This merit guarantees the Stochastic gradient descent (SGD) [6, 38] is an efficient variant of
accuracy of CGD. We further evaluate the accuracy performance the gradient descent algorithm. It is extensively used for optimizing
of CGD with two popular benchmark datasets, i.e., MNIST [28] the objective function in machine learning and deep learning. Given
and CIFAR-10 [26]. Our experiments demonstrate that its accu- a cost function 𝐹 with the parameter 𝑤, SGD is defined by
racy closely approaches that of the centralized learning. When 1
the confined models are initialized with the standard initialization 𝑤𝑘+1 ← 𝑤𝑘 − 𝛼𝑘 ∇𝐹 (𝑤𝑘 , 𝜉𝑘 ), (1)
|𝜉𝑘 |
scheme (i.e., the Gaussian distribution of mean 0 and the variance
1), it achieves marginal differences on validation loss (mostly within where 𝑤𝑘 are the parameter at the 𝑘 𝑡ℎ iteration, 𝜉𝑘 ∈ 𝜉 is a randomly
0.05) and accuracy (mostly within 1%) and outperforms state-of- selected subset of the training samples at the 𝑘 𝑡ℎ iteration, and 𝛼𝑘
the-art federated learning with differential privacy. Its accuracy is the learning rate. Equation 1 can generalize to the mini-batch
performance remains stable even when the interval ranges of initial update when 1 < |𝜉𝑘 | < |𝜉 |, and to the batch update when 𝜉𝑘 = 𝜉.
weights among participants differ by two orders of magnitude. In FL, each local participant 𝑙 ∈ L holds a subset of the training
Contributions. We summarize the main contributions as follows. samples, denoted by 𝜉𝑙 . To run SGD (or the mini-batch update), for
• Confined Gradient Descent For Privacy-enhancing De- each iteration, a random subset 𝜉𝑙,𝑘 ⊆ 𝜉𝑙 from a random participant
centralized Federated Learning. We propose a new opti- 𝑙 is selected. The participant 𝑙 then computes the gradient with
mization algorithm CGD for privacy-preserving decentral- respect to 𝜉𝑙,𝑘 , which can be written as ∇𝐹 (𝑤𝑘 , 𝜉𝑙,𝑘 ), and shares
ized FL. CGD eliminates the explicit sharing of the global the gradient with other participants (or a parameter server). All the
model and lets each participant learn a proprietary confined participants (or the server) can thus take a gradient descent step by
model. CGD retains the merits of traditional FL such as algo- 1
𝑤𝑘+1 ← 𝑤𝑘 − 𝛼𝑘 ∇𝐹 (𝑤𝑘 , 𝜉𝑙,𝑘 ). (2)
rithm independence. Therefore, it can easily accommodate |𝜉𝑙,𝑘 |
2
𝑤#$ local data 𝜉"
𝑤! 𝑤#"
local data 𝜉! 𝑔$ = 𝑔(𝑤 $ , 𝜉$ )
local data 𝜉! 𝑔# = 𝑔(𝑤, 𝜉# )
𝑔" = 𝑔(𝑤 " , 𝜉" ) 𝑤#%
local data 𝜉" 𝑔$ = 𝑔(𝑤, 𝜉$ )
local data 𝜉#
local data 𝜉# 𝑔% = 𝑔(𝑤, 𝜉% )
𝑤∗$ 𝑔% = 𝑔(𝑤 % ,𝜉% )
𝑤∗ 𝑤∗"
𝑤∗%
shared information in FL
protected information in FL
𝑤∗ = 𝑤! − 𝑖𝑡𝑒𝑟 𝑓(𝑔 𝑤, 𝜉# , 𝑔 𝑤, 𝜉$ , 𝑔 𝑤, 𝜉% ) 𝑤∗" = 𝑤#" − 𝑖𝑡𝑒𝑟( 𝝨(𝑔 𝑤 " , 𝜉" , 𝑔 𝑤 $ , 𝜉$ , 𝑔 𝑤 % , 𝜉% ))
a) Gradient descent in traditional federated learning. Partici- b) Confined Gradient Descent. Each participant strictly
pants jointly work on the same global model 𝑤 using the descent confines their own global models from their initializa-
computed by 𝑓 . Although the local gradients 𝑔𝑎 , 𝑔𝑏 , 𝑔𝑐 can be se- tion (𝑤0𝑎 , 𝑤0𝑏 , 𝑤0𝑐 ) to optimal values (𝑤∗𝑎 , 𝑤∗𝑏 , 𝑤∗𝑐 ). The con-
curely synthesized, by knowing 𝑤 and 𝑓 , the adversary is able to fined models descend in the same pace, and when CGD
derive information about the local raw data 𝜉𝑎 , 𝜉𝑏 , 𝜉𝑐 . converges, reach the bottom of the valley where the
centralized model is located. Any two confined models
keep the same distance throughout the training process.
In other words, 𝑤𝑎 , 𝑤𝑏 , 𝑤𝑐 would not become closer to
each other during descending, preventing any participant
from predicting models of others.
Figure 1: Comparison of CGD and the gradient descent in traditional FL. This figure does not differentiate each iteration: the
occurrences of traditional FL’s global model 𝑤𝑘 and confined models 𝑤𝑘𝑎 , 𝑤𝑘𝑏 , 𝑤𝑘𝑐 in all iterations are represented by 𝑤, 𝑤 𝑎 , 𝑤 𝑏 , 𝑤 𝑐 ,
and 𝑖𝑡𝑒𝑟 represents the sum-up of all iterations.
The gradients, if shared in plain text, are subject to information the same global model, and this is subject to membership inference,
leakage of the local training data. For example, model-inversion as revealed by Nasr et al. [35].
attacks [9, 32, 43] are able to restore training data from the gradients.
In the immediately following sections, we summarize the existing 2.3 Learning with differential privacy
privacy-preserving methods for synthesizing the local gradients,
Another line of studies that approaches to privacy-preserving FL
which fall into two broad categories, i.e., secure aggregation and
is through differential privacy (DP) mechanism [2, 8, 12, 16, 21, 42,
learning with differential privacy.
44, 48, 49]. The common practice of achieving differential privacy
2 . As
is based on additive noise calibrated to ∇𝐹 ’s sensitivity S ∇𝐹
2.2 Secure aggregation such, a differentially private learning framework can be achieved
by updating parameters with perturbed gradients at each iteration,
Secure aggregation typically employs cryptographic mechanisms for example, to update parameters as
such as homomorphic encryption (HE) [7, 31, 40] and/or secure
multiparty computation (MPC) [4, 10, 11, 30, 34, 47] to securely 1 2
𝑤𝑘+1 ← 𝑤𝑘 − 𝛼𝑘 (∇𝐹 (𝑤𝑘 , 𝜉𝑙,𝑘 ) + N (0, S ∇𝐹 · 𝜎 2 )), (3)
evaluate the gradient ∇𝐹 (𝑤𝑘 , 𝜉𝑙,𝑘 ) without revealing local data. |𝜉𝑙,𝑘 |
With ∇𝐹 (𝑤𝑘 , 𝜉𝑙,𝑘 ), all the participants can thus take a gradient
descent step by Equation 2. 2 ·𝜎 2 ) is the Gaussian distribution (a commonly used
where N (0, S ∇𝐹
Some existing studies fall into this category. For instance, Bonawitz noise distribution in differentially private learning frameworks [8])
et al. [4] present a secure aggregation protocol that allows a server with mean 0 and standard deviation S ∇𝐹 · 𝜎.
to compute the sum of user-held data vectors, which can be used The privacy loss is accumulated with repeated access to the
to aggregate user-provided model updates for a deep neural net- data during training epochs [2]. There is also an inherent tradeoff
work. Mohassel et al. [34] propose a secure two-party computation between privacy and utility of the trained model.
(2PC) protocol that supports secure arithmetic operations on shared In summary, in all of the above approaches, the global model
decimal numbers for calculating the gradient updates using SGD. has to be shared with each participant, leading to the leakage of
Existing FL frameworks employing HE/MPC are mainly based on information. This motivates CGD’s design to eliminate the explicit
federated SGD. All the participants in it share and update one and sharing of the central global model.
3
3 CONFINED GRADIENT DESCENT
CGD optimizes an objective function in FL with multiple local
centralized model
datasets. It starts with a colony of discrete points, and then uses the
combination of their gradients to lead the optimization to another
colony of points at the neighborhood of the global optimum. In
training data collected centrally
this section, we formalize this problem and present the workflow
of CGD optimization. a) Centralized Training
Parameter server
3.1 Problem formulation
3.1.1 Optimization objective. Consider a centralized dataset 𝜉 =
global model
{(𝑥𝑖 , 𝑦𝑖 )}𝑛𝑖=1 consisting of 𝑛 training samples. The goal of machine
pull parameters from push gradients via privacy
learning is to find a model parameter 𝑤 such that the overall loss, server and replace the preserving techniques,
+ corresponding local + such as MPC/HE/DP +
which is measured by the distance between the model prediction parameters mechanism
where 𝐹 (𝑤; 𝜉) is the loss function, and 𝑧 (𝑤) is the regularizer for 𝑤. b) Traditional Federated Learning
We use 𝑤 ∗ to denote the optimal solution of centralized training (i.e.,
the centralized model).
In the context of FL, we have a system of 𝑚 local participants,
each of which holds a private dataset 𝜉𝑙 ⊆ 𝜉 (𝑙 ∈ [1, 𝑚]) consisting confined
global model
of a part of the training dataset. The part could be a part of training ③,④
confined Σ confined
samples, a part of features that have common entries, or both. As- global model global model
③,④ ③,④
sume the training takes 𝑇 iterations, and let 𝑤𝑘𝑙 denote the confined Σ Σ
… local gradients
②
secure addition
model of participant 𝑙 at the 𝑘 𝑡ℎ iteration, where 𝑘 ∈ [1,𝑇 ]. Let ①
• Initialization. Each participant 𝑙 randomizes its own 𝑤 1𝑙 . Algorithm 1 Confined Gradient Descent Optimization
A default setting is to sample based on the Gaussian distri- 1: Input: Local training data 𝜉𝑙 (𝑙 ∈ [1, 𝑚]),
bution of mean 0 and the variance 1, which is the standard 2: number of training iterations 𝑇
weight initialization scheme used in most machine learning 3: Output: Confined global model parameters 𝑤 ∗𝑙 (𝑙 ∈ [1, 𝑚])
approaches [13, 27, 33]. To prevent the colluding participants 4: Initialize: 𝑘 ← 1, each participant 𝑙 randomizes its own 𝑤 1𝑙
from inferring others’ points by the knowledge of the av- 5: while 𝑘 ≤ 𝑇 do
erage distance among the confined models, we introduce a 6: for all participants 𝑙 ∈ [1, 𝑚] do in parallel
hyper-parameter 𝛿 𝑙 to control the interval range of initial 7: Compute the local gradient: 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 )
weights 𝑤 1𝑙 , i.e., 𝑤 1𝑙 ∼ 𝑁 [−𝛿 𝑙 , 𝛿 𝑙 ], and allows each partici- 𝑚
Í
pant 𝑙 to independently choose its own 𝛿 𝑙 . 8: Securely evaluate the sum: 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 )
𝑙=1
• Step 1. At each iteration 𝑘, every participant computes the 9: Choose a stepsize: 𝛼𝑘
local gradient 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ) with respect to its current confined 𝑙
𝑚
Í
10: Set the new iterate as: 𝑤𝑘+1 ← 𝑤𝑘𝑙 − 𝛼𝑘 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 )
model 𝑤𝑘𝑙 and own dataset 𝜉𝑙 using Equation 5. 𝑙=1
𝑚
Í 11: end for
• Step 2. Securely compute 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ) which is later used 12: 𝑘 ←𝑘 +1
𝑙=1
𝑙
for calculating 𝑤𝑘+1 (double lines in Figure 2c). 13: end while
This step presumes that computational tools exist for se-
curely evaluating the sum of a set of secret values to avoid
releasing local gradient 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ) in plain text. We refer to model, defined as
Section 5 for more detail. 𝑇
1 ∑︁
• Step 3. A scalar stepsize 𝛼𝑘 > 0 is chosen given an iteration R= (𝐹 (𝑤𝑘𝑙 ) − 𝐹 (𝑤 ∗ )) (6)
number 𝑘 ∈ [1,𝑇 ]. 𝑇
𝑘=1
• Step 4. Every participant takes a descent step on its own 𝑤𝑘𝑙
𝑚
Í 𝑚
Í 4.1 Assumptions
with 𝛼𝑘 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ), i.e., 𝑤𝑘+1
𝑙 ← 𝑤𝑘𝑙 − 𝛼𝑘 𝑔𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ).
𝑙=1 𝑙=1 We make the following assumptions on the loss function 𝐹 . They all
are common assumptions in convergence analyses of most gradient-
based methods, and satisfied in a variety of widely used cost func-
tions [6], such as mean squared error (MSE) and cross entropy.
Techniques The observation of an honest-but-curious Exposed indicative information Boosted privacy of CGD over traditional FL
white-box adversary during training iter- about (𝑥𝑙 , 𝑦𝑙 ) (in the example of
ations linear regression)
Plain federated Local gradients: 𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ), which equals {𝑥𝑙𝑇 𝑥𝑙 , 𝑥𝑙𝑇 𝑦𝑙 } In plain federated SGD, indicative information
SGD 𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) about the local training dataset can be observed.
𝑇 𝑇
Secure aggregated Sum of the local gradients from the remaining {𝑥 D\𝑐 𝑥 D\𝑐 , 𝑥 D\𝑐 𝑦 D\𝑐 } With the shared global 𝑤𝑘 , indicative information
federated SGD
Í 𝑙
honest participants: 𝑔𝑘 (𝑤𝑘 , 𝜉𝑙 ) |𝑙 ∈D\𝑐 , about the concatenated training datasets from
and with a shared 𝑤𝑘 , it equals honest participants can be observed.
𝑇
𝑥 D\𝑐 Δ(𝑥 D\𝑐 𝑤𝑘 , 𝑦 D\𝑐 ) (Theorem 3).
Differentially Perturbed local gradients: 𝑔𝑘𝑙 (𝑤𝑘 , 𝜉𝑙 ) + N𝑘 , Not applicable with the single access The decay of privacy with increasing number of
private federated which equals 𝑥𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ) + N𝑘 to (𝑥𝑙 , 𝑦𝑙 ). However, with repeated ac- training epochs is one of the limitations of most
𝑙
SGD (additive cess to (𝑥𝑙 , 𝑦𝑙 ) during epochs, the ad- additive-noise based differentially private learn-
noise based mech- versary is able to get a more accurate ing.
anism) estimate of 𝑥𝑙𝑇 Δ(𝑥𝑙 𝑤𝑘 , 𝑦𝑙 ), which can
be used to obtain {𝑥𝑙𝑇 𝑥𝑙 , 𝑥𝑙𝑇 𝑦𝑙 }.
CGD Sum of the local gradients from the remaining Not applicable. In CGD, each local gradient is computed from
Í
honest participants: 𝑔𝑘𝑙 (𝑤𝑘𝑙 , 𝜉𝑙 ) |𝑙 ∈L\𝑐 (The- a different and private 𝑤𝑘𝑙 , and only the sum of
orem 2). local gradients is exposed during the training. As
such, (1) the indicative information exposed in
the secure aggregation cannot be derived in CGD;
(2) the randomness introduced in each 𝑤𝑘𝑙 hides
the information about local gradient throughout
the training, and thus privacy does not decay.
Validation accuracy
1:
Validation loss
5
functions of 𝑁 layers: 𝜎 (1) ...𝜎 (𝑁 ) , cost function 𝐽 , the number 4 0.6
of training iterations 𝑇 3
0.4
2
2: Output: Confined global model parameters of 𝑁 layers: Centralized training
1 0.2 CGD training
(1)𝑙 (𝑁 )𝑙 Local training
𝑤 ∗ (𝑖,𝑗 ) ,...,𝑤 ∗ (𝑖,𝑗 )
(𝑖 𝑗 ∈ 𝑚ℎ , ∈ 𝑚𝑣 ) 0
0 500 1000 1500 2000 0 500 1000 1500 2000
3: Initialize: 𝑘 ← 1, each participant 𝑙 (𝑖,𝑗) randomizes its own Training iterations Training iterations
(1)𝑙 (𝑁 )𝑙 a) 𝑚 = (10 × 7) b) 𝑚 = (10 × 7)
𝑤 1 (𝑖,𝑗 ) ,...,𝑤 1 (𝑖,𝑗 )
4: while 𝑘 ≤ 𝑇 do
1.0
Forward propagation 7 Centralized training
CGD training
for all participants 𝑙 (𝑖,𝑗) do in parallel 6 Local training 0.8
Validation accuracy
5:
Validation loss
5
(1)𝑙 (1)𝑙 0.6 Centralized training
6: 𝑧𝑘 (𝑖,𝑗 ) ← 𝜉𝑙 (𝑖,𝑗 ) 𝑤𝑘 (𝑖,𝑗 ) 4 CGD training
3 Local training
(1)𝑙 (1)𝑙 0.4
7: 𝑎𝑘 (𝑖,𝑗 ) ← 𝜎 (1) (𝑧𝑘 (𝑖,𝑗 ) ) 2
(2)𝑙 (1)𝑙 (2)𝑙 1 0.2
8: 𝑧𝑘 (𝑖,𝑗 ) ← 𝑎𝑘 (𝑖,𝑗 ) 𝑤𝑘 (𝑖,𝑗 ) 0
0 500 1000 1500 2000 0 500 1000 1500 2000
9: ...... Training iterations Training iterations
(𝑁 )𝑙 (𝑁 )𝑙 (𝑁 )𝑙
10: 𝑎𝑘 (𝑖,𝑗 )
← 𝜎 (𝑁 ) (𝑧𝑘 (𝑖,𝑗 ) )), 𝑦b𝑙 (𝑖,𝑗 )
← 𝑎𝑘 (𝑖,𝑗 ) c) 𝑚 = (100 × 49) d) 𝑚 = (100 × 49)
Validation accuracy
for all participants 𝑙 (𝑖,𝑗) do in parallel
Validation loss
12: 5
0.6 Centralized training
𝑙 𝑙 (𝑖,𝑗 )
(𝑁 )𝑙 (𝑖,𝑗 ) 4 CGD training
(𝑁 )𝑙 (𝑖,𝑗 ) b (𝑖,𝑗 ) ,𝑦
𝜕𝐽 ( 𝑦 ) 𝜕𝑎𝑘 3 Local training
𝛿𝑘 ← (𝑁 )𝑙 (𝑖,𝑗 ) (𝑁 )𝑙 (𝑖,𝑗 ) 0.4
2
𝜕𝑎𝑘 𝜕𝑧
(𝑁 )𝑙 (𝑖,𝑗 )
𝑘 1 0.2
(𝑁 )𝑙 (𝑖,𝑗 ) (𝑁 )𝑙 (𝑖,𝑗 ) 𝜕𝑧𝑘 0
13: 𝑔𝑘 ← 𝛿𝑘 (𝑁 )𝑙 (𝑖,𝑗 )
0 500 1000
Training iterations
1500 2000 0 500 1000
Training iterations
1500 2000
𝜕𝑤𝑘
e) 𝑚 = (1000 × 112) f) 𝑚 = (1000 × 112)
14: for 𝑟 = 𝑁 − 1, ..., 1 do
(𝑟 +1)𝑙 (𝑖,𝑗 ) (𝑟 )𝑙 (𝑖,𝑗 )
(𝑟 )𝑙 (𝑖,𝑗 ) (𝑟 +1)𝑙 (𝑖,𝑗 ) 𝜕𝑧𝑘 𝜕𝑎𝑘
15: 𝛿𝑘 ← 𝛿𝑘 (𝑟 )𝑙 (𝑖,𝑗 ) (𝑟𝑎)𝑙 (𝑖,𝑗 ) Figure 4: Results on the validation loss and accuracy for dif-
𝜕𝑎𝑘 𝜕𝑧𝑘 ferent number of participants on the MNIST dataset.
(𝑟 )𝑙 (𝑖,𝑗 )
(𝑟 )𝑙 (𝑖,𝑗 ) (𝑟 )𝑙 (𝑖,𝑗 ) 𝜕𝑧𝑘
16: 𝑔𝑘 ← 𝛿𝑘 (𝑟 )𝑙 (𝑖,𝑗 )
𝜕𝑤𝑘
17: end for • the standard weight initialization scheme based on the Gauss-
18: end for ian distribution of mean 0 and variance 1.
Securely evaluate the sum of local gradients Figure 4 summarizes the performance with varying number of par-
19: for all participants 𝑙 (𝑖,𝑗) do in parallel ticipants. In general, CGD stays close to the centralized baseline
(𝑟 ) Í ℎ ,𝑚 𝑣 (𝑟 )𝑙 (𝑖,𝑗 ) in both validation loss and accuracy, and as expected, significantly
20: 𝑔𝑘 ← 𝑚 𝑖,𝑗=1 𝑔𝑘 , (𝑟 ∈ [2, 𝑁 ])
outperforms the local training. In the worst case of 𝑚 = (1000×112)
21: end for
when CGD contains the greatest number of participants, i.e., each
22: for all groups 𝑙 𝑗 = {𝑙 (1,𝑗) , ..., 𝑙 (𝑚ℎ ,𝑗) } do in parallel
Í ℎ (1)𝑙 (𝑖,𝑗 ) participant owns a small proportion consisting of 60 samples with 7
(1)𝑙
23: Evaluate the sum of first layer: 𝑔𝑘 𝑗 ← 𝑚 𝑖=1 𝑔𝑘 features, CGD still achieves 0.209 and 93.74% in validation loss and
24: end for accuracy respectively, close to 0.081 and 97.54% of the centralized
Descent baseline. In contrast, the performance of local training declines to
25: Choose a stepsize: 𝛼𝑘 2.348 and 11.64%. Table 3 lists the detailed performance comparison
26: for all groups 𝑙 𝑗 = {𝑙 (1,𝑗) , ..., 𝑙 (𝑚ℎ ,𝑗) } do in parallel with both centralized and local trainings, and Table 4 lists the com-
(1)𝑙 (1)𝑙 (𝑖,𝑗 ) (1)𝑙 𝑗 parison with DP-FL. More results are given with varying number
27: 𝑤𝑘+1 (𝑖,𝑗 ) ← 𝑤𝑘 − 𝛼𝑘 𝑔𝑘
of participants in Appendix C (Figure 9 and Table 8).
28: end for
29: for all participants 𝑙 (𝑖,𝑗) do in parallel 7.1.2 CIFAR-10. Our second set of experiments are conducted on
(𝑟 )𝑙 (𝑟 )𝑙 (𝑖,𝑗 ) (𝑟 ) the CIFAR-10 dataset. It consists of 60000 32 × 32 colour images
30: 𝑤𝑘+1 (𝑖,𝑗 ) ← 𝑤𝑘 − 𝛼𝑘 𝑔𝑘 , (𝑟 ∈ [2, 𝑁 ])
31: end for in 10 classes (e.g., airplane, bird, and cat), with 6000 images per
32: 𝑘 ←𝑘 +1 class. The images are divided into 50000 training images and 10000
33: end while test images. We use the ResNet-56 architecture proposed by He
et al. [17]. It takes as input images of size 32 × 32, with the per-
pixel mean subtracted. Its first layer is 3 × 3 convolutions, and then
12
Table 3: Performance comparison on MNIST with default
settings.
12 Centralized training
CGD training 0.7
10 Local training
Validation accuracy
Validation loss Validation accuracy 0.6
Validation loss
8
0.5
Centralized 0.081 97.54%
6 0.4
CGD Local training CGD Local training 4 0.3
m=(10 × 7) 0.143 1.283 96.5% 54.39% Centralized training
2 0.2 CGD training
m=(100 × 49) 0.189 2.076 94.21% 23.53% 0.1 Local training
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000
m=(1000 × 112) 0.209 2.348 93.74% 11.64% Training iterations Training iterations
a) 𝑚 = (10 × 2) b) 𝑚 = (10 × 2)
Table 4: Comparison between CGD and traditional FL via dif-
ferential privacy in terms of the difference of validation ac- 12 Centralized training
curacy to the centralized baseline on MNIST. CGD training 0.7
10 Local training
Validation accuracy
0.6
Validation loss
8
0.5 Centralized training
CGD 6 CGD training
0.4 Local training
Participants Difference to the baseline 4 0.3
m=(10 × 7) 1.0% 2 0.2
m=(100 × 49) 3.3% 0.1
0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000
m=(1000 × 112) 3.8% Training iterations Training iterations
Traditional FL via DP [2] c) 𝑚 = (100 × 16) d) 𝑚 = (100 × 16)
Noise levels Difference to the baseline1
𝜖 = 8 (small noise) 1.3%
12 Centralized training
𝜖 = 2 (medium noise) 3.3% CGD training 0.7
10 Local training
Validation accuracy
𝜖 = 0.5 (large noise) 8.3% 0.6
Validation loss
8
1 The centralized baseline of validation accuracy on the MINIST in [2] is 98.3%. 0.5 Centralized training
6 CGD training
0.4 Local training
4 0.3
2 0.2
a stack of 3 × 18 layers is used, with 3 × 3 convolutions of filter
0.1
sizes {32, 16, 8} respectively, and 18 layers for each filter size. The 0 1000 2000 3000 4000 5000 6000 0 1000 2000 3000 4000 5000 6000
Training iterations Training iterations
numbers of filters are {16, 32, 64} in each stack. The network ends e) 𝑚 = (1000 × 32) f) 𝑚 = (1000 × 32)
with an average pooling layer, a fully-connected (FC) layer with
1024 units, and softmax. Our training data augmentation follows Figure 5: Results on the validation loss and accuracy for dif-
the setting in [17]. For each training image, we generate a new ferent number of participants on the CIFAR-10 dataset.
distorted image by randomly flipping the image horizontally with
probability 0.5, and adding 4 amounts of padding to all sides before
cropping the image to the size of 32 × 32.
Since our focus is on evaluating the proposed CGD, rather than
enhancing the state-of-the-art analysis on CIFAR-10, we utilize the Table 5: Performance comparison on CIFAR-10.
transferability2 of convolutional layers to save the computational
cost of computing per-example gradients. We follow the experi- Validation loss Validation accuracy
Centralized 0.675 75.72%
ment setting of Abadi et al. [2], which treats the CIFAR-10 as the
private dataset and CIFAR-100 as a public dataset. CIFAR-100 has CGD Local training CGD Local training
the same image types as CIFAR-10, and it has 100 classes contain- m=(10 × 2) 0.720 0.980 74.79% 65.3%
m=(100 × 16) 0.723 2.008 74.42% 26.25%
ing 600 images each. We use CIFAR-100 to train a network with
m=(1000 × 32) 0.724 2.434 74.29% 16.19%
the aforementioned architecture, and freeze the parameters of the
convolutional layers and retrain only the last FC layer on CIFAR-10.
Training on the entire dataset with batch update reaches the
validation accuracy of 75.75%, which is taken as our centralized
training baseline3 . For the CGD training, each participant feeds the
pre-trained convolutional layers with their private data partitions, Figure 5 and Table 5 summarize our experimental results against
generates the input features to the FC layer, and randomly initializes centralized training and local training. The results on the validation
the confined model parameters of the FC layer. They then take the loss and accuracy are generally in line with that on the MNIST
input to the FC layer as the private local training data, and train a dataset. In the worst case of 𝑚 = (1000 × 32), CGD achieves 0.724
model using Algorithm 1 with 𝛿 = 0.01 and 𝑇 = 6000. and 74.29% in validation loss and accuracy respectively, which are
relatively near to the centralized baseline (0.675 and 75.72%). Table 6
2 Transfer learning allows the analyst to take a model trained on one dataset and summarize the results against DP-FL. In line with the results on
transfer it to another without retraining [41].
3 We note that by making the network deeper or using other advanced techniques, MNIST, the accuracy difference to the centralized baseline in CGD
better accuracy can be obtained, with the state-of-the-art being about 99.37% [23]. is smaller than that in DP-FL.
13
Table 6: Comparison between CGD and the traditional FL on
CIFAR-10.
Centralized training
0.8 delta=0.1 0.95
CGD delta=0.06
Validation accuracy
delta=0.01
Validation loss
Participants Difference to the baseline 0.6 0.90
m=(10 × 2) 0.93%
0.4 0.85
m=(100 × 16) 1.30%
Centralized training
m=(1000 × 32) 1.43% 0.80 delta=0.1
0.2 delta=0.06
Traditional FL via DP [2] delta=0.01
Difference to the baseline1 0 500 1000 1500 2000 0 500 1000 1500 2000
Noise levels Training iterations Training iterations
𝜖 = 8 (small noise) 7% a) Validation loss b) Validation accuracy
𝜖 = 4 (medium noise) 10%
𝜖 = 2 (large noise) 13%
1 Figure 6: Results on the validation loss and accuracy for dif-
The centralized baseline of validation accuracy on the CIFAR-10 in [2] is 80%.
ferent initialization parameter 𝛿 with 𝑚 = (1000 × 112) on the
MNIST dataset.
Validation accuracy
m=(10*7)
Validation loss
0.6 0.85
7.2.1 Role of the initialization. According to Theorem 1, the gap 0.80
0.4
between CGD solution and the centralized model is bounded by 0.75 Centralized training
𝑗
𝜖 = 𝑚∥ E (𝑤 1𝑙 −𝑤 1 )∥, in which each 𝑤 1𝑙 is decided by 𝛿 in Equation 0.70 m=(1000*112)
0.2 m=(100*49)
𝑗 ∈𝑚 0.65 m=(10*7)
11. Therefore, we investigate how the initialization setting affects 0 500 1000 1500 2000 0 500 1000 1500 2000
Training iterations Training iterations
CGD’s performance.
a) Validation loss b) Validation accuracy
To this end, we conduct two experiments in which we mutate
parameter 𝛿 while keeping other settings unchanging (i.e., 𝜇 = 0
Figure 7: Results on the validation loss and accuracy for vari-
and 𝑇 = 2000). In the first experiment, we let all participants use the
ous number of participants with 𝛿 selected uniformly at ran-
same 𝛿 in {0.1, 0.06, 0.01, 0.001}, as they are around √ 1 (recall
60000 dom on the MNIST dataset.
that 60000 is the sample size of MNIST). In the other experiment, we
let each participant randomly select its own 𝛿 based on the uniform
Table 7: Performance with different initialization parameter
distribution within the range from 0.001 to 0.1.
𝛿 for different number of participants on the MNIST dataset.
The results of our first experiment are shown in Figure 6 and
the first five columns in Table 7. In general, as 𝛿 decreases, CGD
Validation loss
achieves better performance. This confirms our expectation: de- 𝛿 = 0.1 𝛿 = 0.06 𝛿 = 0.01 𝛿 = 0.001
𝑗 Random
creasing 𝛿 would reduce the value of ∥ E (𝑤 1𝑙 − 𝑤 1 )∥, such that Centralized 0.080 0.074 0.078 0.111
𝑗 ∈𝑚ℎ m=(10 × 7) 0.143 0.106 0.092 0.121 0.110
the confined models are closer to the centralized optimum. We m=(100 × 49) 0.188 0.125 0.094 0.129 0.124
have not observed significant difference from 𝛿 = 0.1, 0.06 and 0.01. m=(1000 × 112) 0.209 0.132 0.096 0.117 0.136
In the case of 𝑚 = (1000 × 112), the validation accuracy and loss Validation accuracy
achieve 97.08% and 0.096 with 𝛿 = 0.01 which outperform 93.74% 𝛿 = 0.1 𝛿 = 0.06 𝛿 = 0.01 𝛿 = 0.001
Random
Centralized 97.54% 97.79% 97.61% 96.69%
and 0.209 in the setting of 𝛿 = 0.1, but both are close to the cen- m=(10 × 7) 95.60% 96.72% 97.21% 96.38% 96.59%
tralized model whose validation accuracy and loss are 97.61% and m=(100 × 49) 94.22% 96.05% 97.15% 96.19% 96.13%
0.078 respectively. However, 𝛿 cannot be set too small, in order to m=(1000 × 112) 93.74% 95.91% 97.08% 96.59% 95.82%
maintain numerical stability in neural network [13]. For example,
when we lower 𝛿 to 0.001, the performance starts decreasing.
The results of our second experiment are shown in Figure 7 conduct an experiment to investigate this relation. In this experi-
and the last column in Table 7. When the participants uniformly ment, we tune 𝜇 while keeping 𝛿 fixed as 0.1. For the stepsize (𝛼𝑘 ),
randomize their 𝛿s from 0.001 to 0.1, the performance of CGD we run CGD with a fixed 𝛼 till it (approximately) reaches a preferred
is still comparable with the centralized mode. For example, with point, and then continue our experiment with 𝛼 1 = 𝛼. This is to
𝑚 = (1000×112) participants, CGD achieves 95.82% and 0.136 in the keep CGD practical, as each time the stepsize is diminished, more
validation accuracy and loss. This suggests that CGD keeps robust iterations are required. Therefore, the first 6000 iterations is run
when the 𝛿s of its participants differ by two orders of magnitude. with 𝛼 = 0.01, and then diminish 𝛼𝑘 in each of the following 8000
iterations.
7.2.2 Effect of the parameter 𝜇. According to Theorem 1, the con- Figure 8 demonstrates the results with varying 𝜇 for 𝑚 = (100 ×
vergence speed of CGD is affected by the parameter 𝜇. We thus 49) and 𝑚 = (1000 × 112). For the case of 𝑚 = (100 × 49), 𝜇 ≥ 0.01
14
[6] Léon Bottou, Frank E Curtis, and Jorge Nocedal. 2018. Optimization methods for
large-scale machine learning. Siam Review 60, 2 (2018), 223–311.
0.959 0.957
0<mu<0.01 0<mu<=0.05 [7] Yi-Ruei Chen, Amir Rezapour, and Wen-Guey Tzeng. 2018. Privacy-preserving
0.01=<mu<1 0.05<mu<1
0.958
0.956 ridge regression on distributed data. Information Sciences 451 (2018), 34–49.
Validation accuracy
Validation accuracy
0.955 [8] Cynthia Dwork, Aaron Roth, et al. 2014. The algorithmic foundations of differ-
0.957 ential privacy. Foundations and Trends® in Theoretical Computer Science 9, 3–4
0.954 (2014), 211–407.
0.956 0.953 [9] Matt Fredrikson, Somesh Jha, and Thomas Ristenpart. 2015. Model inversion
0.952
attacks that exploit confidence information and basic countermeasures. In Pro-
0.955
ceedings of the 22nd ACM SIGSAC Conference on Computer and Communications
6000 7000 8000 9000 10000 11000 12000
0.951
6000 7000 8000 9000 10000 11000 12000 13000 14000 Security. 1322–1333.
Training iterations Training iterations [10] Adrià Gascón, Phillipp Schoppmann, Borja Balle, Mariana Raykova, Jack Doerner,
a) m=(100 × 49) b) m=(1000 × 112) Samee Zahur, and David Evans. 2016. Secure Linear Regression on Vertically
Partitioned Datasets. IACR Cryptology ePrint Archive 2016 (2016), 892.
[11] Adrià Gascón, Phillipp Schoppmann, Borja Balle, Mariana Raykova, Jack Do-
Figure 8: Validation accuracy with different parameter 𝜇 for erner, Samee Zahur, and David Evans. 2017. Privacy-preserving distributed
different number of participants on the MNIST dataset. linear regression on high-dimensional data. Proceedings on Privacy Enhancing
Technologies 2017, 4 (2017), 345–364.
[12] Robin C Geyer, Tassilo Klein, and Moin Nabi. 2017. Differentially private federated
learning: A client level perspective. arXiv preprint arXiv:1712.07557 (2017).
[13] Xavier Glorot and Yoshua Bengio. 2010. Understanding the difficulty of training
gives a slightly faster speed. CGD reaches the validation accuracy deep feedforward neural networks. In Proceedings of the thirteenth international
of 95.9% at 11401𝑡ℎ iteration, 2340 iterations (16.7% less) faster than conference on artificial intelligence and statistics. 249–256.
[14] Oded Goldreich, Silvio Micali, and Avi Wigderson. 2019. How to play any mental
training with 𝜇 < 0.01. For the case of 𝑚 = 1000 × 112), 𝜇 > 0.05 game, or a completeness theorem for protocols with honest majority. In Providing
gives a faster speed. CGD reaches the validation accuracy 95.68% at Sound Foundations for Cryptography: On the Work of Shafi Goldwasser and Silvio
11821𝑡ℎ iteration, 2140 iterations (15.28% less) faster than training Micali. 307–328.
[15] Gaël Guennebaud, Benoît Jacob, et al. 2010. Eigen v3. http://eigen.tuxfamily.org.
with 𝜇 ≤ 0.05. Even though such slight difference is observed, our [16] Inken Hagestedt, Yang Zhang, Mathias Humbert, Pascal Berrang, Haixu Tang, Xi-
experiment suggests that the effect of tuning 𝜇 is relatively limited. aoFeng Wang, and Michael Backes. 2019. MBeacon: Privacy-Preserving Beacons
for DNA Methylation Data.. In NDSS.
The gain of the validation accuracy is only within 0.2% in the cost [17] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. 2016. Deep residual
of around 2000 iterations. learning for image recognition. In Proceedings of the IEEE conference on computer
vision and pattern recognition. 770–778.
[18] Pieter Hintjens. 2013. ZeroMQ: messaging for many applications. " O’Reilly Media,
8 CONCLUSION Inc.".
[19] Qirong Ho, James Cipar, Henggang Cui, Seunghak Lee, Jin Kyu Kim, Phillip B
We have presented CGD, a novel optimization algorithm for learn- Gibbons, Garth A Gibson, Greg Ganger, and Eric P Xing. 2013. More effective
ing confined models to enhance privacy for federated learning. distributed ml via a stale synchronous parallel parameter server. In Advances in
Privacy preservation is achieved against an honest-but-curious ad- neural information processing systems. 1223–1231.
[20] Kevin Hsieh, Aaron Harlap, Nandita Vijaykumar, Dimitris Konomis, Gregory R
versary even though the majority (𝑚 − 2 out of 𝑚) of participants Ganger, Phillip B Gibbons, and Onur Mutlu. 2017. Gaia: Geo-distributed machine
are corrupted. We formally proved the convergence of CGD opti- learning approaching {LAN } speeds. In 14th {USENIX } Symposium on Networked
mization. In our experiments, we achieved validation accuracy of Systems Design and Implementation ( {NSDI } 17). 629–647.
[21] Yaochen Hu, Di Niu, Jianming Yang, and Shengping Zhou. 2019. FDML: A
97.01% with (1000 × 112) participants on the MNIST dataset, and collaborative machine learning framework for distributed features. In Proceedings
74.4% with (1000 × 32) participants on the CIFAR-10 dataset. Both of the 25th ACM SIGKDD International Conference on Knowledge Discovery & Data
Mining. 2232–2240.
are comparable to the performance of centralized training. [22] Hyesung Kim, Jihong Park, Mehdi Bennis, and Seong-Lyun Kim. 2019.
A number of future work directions are of interest. In particular, Blockchained on-device federated learning. IEEE Communications Letters 24, 6
we see new research opportunities in applying our techniques to (2019), 1279–1283.
[23] Alexander Kolesnikov, Lucas Beyer, Xiaohua Zhai, Joan Puigcerver, Jessica Yung,
asynchronous FL, allowing different participants to be at different Sylvain Gelly, and Neil Houlsby. 2019. Big transfer (bit): General visual represen-
iterations of model updates up to a bounded delay. We are also con- tation learning. arXiv preprint arXiv:1912.11370 6, 2 (2019), 8.
sidering other types of deep networks, for example, unsupervised [24] Jakub Konečnỳ, H Brendan McMahan, Daniel Ramage, and Peter Richtárik. 2016.
Federated optimization: Distributed machine learning for on-device intelligence.
neural networks such as autoencoders. arXiv preprint arXiv:1610.02527 (2016).
[25] Jakub Konečnỳ, H Brendan McMahan, Felix X Yu, Peter Richtárik,
Ananda Theertha Suresh, and Dave Bacon. 2016. Federated learning: Strategies
REFERENCES for improving communication efficiency. arXiv preprint arXiv:1610.05492 (2016).
[1] Martín Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, [26] Alex Krizhevsky, Geoffrey Hinton, et al. 2009. Learning multiple layers of features
Craig Citro, Greg S Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, et al. from tiny images. (2009).
2016. Tensorflow: Large-scale machine learning on heterogeneous distributed [27] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. 2017. Imagenet classi-
systems. arXiv preprint arXiv:1603.04467 (2016). fication with deep convolutional neural networks. Commun. ACM 60, 6 (2017),
[2] Martin Abadi, Andy Chu, Ian Goodfellow, H Brendan McMahan, Ilya Mironov, 84–90.
Kunal Talwar, and Li Zhang. 2016. Deep learning with differential privacy. In [28] Yann LeCun and Corinna Cortes. 2010. MNIST handwritten digit database.
Proceedings of the 2016 ACM SIGSAC Conference on Computer and Communications http://yann.lecun.com/exdb/mnist/. (2010). http://yann.lecun.com/exdb/mnist/
Security. ACM, 308–318. [29] Hsiao-Ying Lin and Wen-Guey Tzeng. 2005. An efficient solution to the million-
[3] Dan Bogdanov, Sven Laur, and Jan Willemson. 2008. Sharemind: A framework aires’ problem based on homomorphic encryption. In International Conference on
for fast privacy-preserving computations. In European Symposium on Research in Applied Cryptography and Network Security. Springer, 456–466.
Computer Security. Springer, 192–206. [30] Jian Liu, Mika Juuti, Yao Lu, and Nadarajah Asokan. 2017. Oblivious neural
[4] Keith Bonawitz, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H Brendan network predictions via minionn transformations. In Proceedings of the 2017 ACM
McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. 2017. Prac- SIGSAC Conference on Computer and Communications Security. ACM, 619–631.
tical secure aggregation for privacy-preserving machine learning. In Proceedings [31] Tilen Marc, Miha Stopar, Jan Hartman, Manca Bizjak, and Jolanda Modic. 2019.
of the 2017 ACM SIGSAC Conference on Computer and Communications Security. Privacy-Enhanced Machine Learning with Functional Encryption. In European
ACM, 1175–1191. Symposium on Research in Computer Security. Springer, 3–21.
[5] Léon Bottou. 2012. Stochastic gradient descent tricks. In Neural networks: Tricks [32] Luca Melis, Congzheng Song, Emiliano De Cristofaro, and Vitaly Shmatikov.
of the trade. Springer, 421–436. 2019. Exploiting unintended feature leakage in collaborative learning. In 2019
15
IEEE Symposium on Security and Privacy (SP). IEEE, 691–706. held by corresponding participants 𝑆 1, ..., 𝑆𝑠 . To do this, each par-
[33] Dmytro Mishkin and Jiri Matas. 2015. All you need is a good init. arXiv preprint ticipant 𝑆𝑖 executes a randomised sharing algorithm 𝑆ℎ𝑟 (𝑠𝑟𝑡𝑖 , 𝑆) to
arXiv:1511.06422 (2015). 1 , ..., 𝐸𝑠 , and distributes each
[34] Payman Mohassel and Yupeng Zhang. 2017. Secureml: A system for scalable split its secret 𝑠𝑟𝑡𝑖 into shares 𝐸𝑠𝑟𝑡 𝑖 𝑠𝑟𝑡𝑖
privacy-preserving machine learning. In 2017 IEEE Symposium on Security and 𝑗
Privacy (SP). IEEE, 19–38.
𝐸𝑠𝑟𝑡𝑖 to the participant 𝑆 𝑗 . Then, each 𝑆𝑖 locally adds the shares
𝑖 , ..., 𝐸𝑖 Í𝑠 𝑖 𝑖
[35] Milad Nasr, Reza Shokri, and Amir Houmansadr. 2019. Comprehensive privacy it holds 𝐸𝑠𝑟𝑡 1 𝑠𝑟𝑡𝑠 and produces 𝑗=1 𝐸𝑠𝑟𝑡 𝑗 (denoted by 𝐸 for
analysis of deep learning: Passive and active white-box inference attacks against
centralized and federated learning. In 2019 IEEE Symposium on Security and brevity). After that, a reconstruction algorithm 𝑅𝑒𝑐 ({(𝐸𝑖 , 𝑆𝑖 )}𝑆𝑖 ∈𝑆 ),
Privacy (SP). IEEE, 739–753. which takes 𝐸𝑖 from each participant and add them together, can
[36] Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael P Wellman. Í
2018. Sok: Security and privacy in machine learning. In 2018 IEEE European be executed by an aggregator to reconstruct the 𝑠𝑖=1 𝑠𝑟𝑡𝑖 without
Symposium on Security and Privacy (EuroS&P). IEEE, 399–414. revealing any secret addends 𝑠𝑟𝑡𝑖 .
[37] Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory
Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. 2019.
Pytorch: An imperative style, high-performance deep learning library. arXiv
preprint arXiv:1912.01703 (2019).
[38] Herbert Robbins and Sutton Monro. 1951. A stochastic approximation method.
The annals of mathematical statistics (1951), 400–407.
[39] Abhijit Guha Roy, Shayan Siddiqui, Sebastian Pölsterl, Nassir Navab, and Chris-
tian Wachinger. 2019. Braintorrent: A peer-to-peer environment for decentralized
federated learning. arXiv preprint arXiv:1905.06731 (2019). Appendix B SIMULATION PARADIGM
[40] Sagar Sharma and Keke Chen. 2019. Confidential boosting with random lin-
ear classifiers for outsourced user-generated data. In European Symposium on In simulation paradigm (a.k.a., the real/ideal model) [14], the se-
Research in Computer Security. Springer, 41–65. curity of a protocol is proved by comparing what an adversary
[41] Hoo-Chang Shin, Holger R Roth, Mingchen Gao, Le Lu, Ziyue Xu, Isabella Nogues,
Jianhua Yao, Daniel Mollura, and Ronald M Summers. 2016. Deep convolutional
can do in a real protocol execution to what it can do in an ideal
neural networks for computer-aided detection: CNN architectures, dataset char- scenario, which is secure by definition. Formally, a protocol P se-
acteristics and transfer learning. IEEE transactions on medical imaging 35, 5 curely computes a functionality F𝑝 , if for every adversary A in the
(2016), 1285–1298.
[42] Reza Shokri and Vitaly Shmatikov. 2015. Privacy-preserving deep learning. In real model, there exists an adversary S in the ideal model, such
Proceedings of the 22nd ACM SIGSAC conference on computer and communications that the view of the adversary from a real execution VIEWreal is
security. ACM, 1310–1321. indistinguishable from the view of the adversary from an ideal exe-
[43] Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. 2017. Mem-
bership inference attacks against machine learning models. In 2017 IEEE Sympo- cution VIEWideal . The adversary S in the ideal model, is called the
sium on Security and Privacy (SP). IEEE, 3–18. simulator. An indistinguishability between VIEWreal and VIEWideal
[44] Shuang Song, Kamalika Chaudhuri, and Anand D Sarwate. 2013. Stochastic gra-
dient descent with differentially private updates. In 2013 IEEE Global Conference
guarantees that the adversary can learn nothing more than their
on Signal and Information Processing. IEEE, 245–248. own inputs and the information required by S for the simulation.
[45] Maha Tebaa, Saïd El Hajji, and Abdellatif El Ghazi. 2012. Homomorphic encryp- In other words, the information required by S for the simulation
tion applied to the cloud computing security. In Proceedings of the World Congress
on Engineering, Vol. 1. 4–6. is the only information that can leak to adversary A from the real
[46] Qiang Yang, Yang Liu, Tianjian Chen, and Yongxin Tong. 2019. Federated machine execution.
learning: Concept and applications. ACM Transactions on Intelligent Systems and Let 𝑐 denote the set of corrupted parties. The simulator performs
Technology (TIST) 10, 2 (2019), 1–19.
[47] Yanjun Zhang, Guangdong Bai, Xue Li, Caitlin Curtis, Chen Chen, and Ryan KL the following operations:
Ko. 2020. PrivColl: Practical Privacy-Preserving Collaborative Machine Learning.
In European Symposium on Research in Computer Security. Springer, 399–418.
[48] Yanjun Zhang, Guangdong Bai, Mingyang Zhong, Xue Li, and Ryan Ko. 2020.
Differentially private collaborative coupling learning for recommender systems.
IEEE Intelligent Systems (2020).
[49] Huadi Zheng, Qingqing Ye, Haibo Hu, Chengfang Fang, and Jie Shi. 2019. BDPL:
A Boundary Differentially Private Layer Against Machine Learning Model Extrac-
tion Attacks. In European Symposium on Research in Computer Security. Springer,
• Generate dummy inputs {𝜂𝑙 } for each honest party 𝑙 ∉ 𝑐 and
66–83. receives the actual inputs {𝑥 𝑙 } of corrupted parties 𝑙 ∈ 𝑐;
• Run P over {𝜂𝑙 } (𝑙 ∉ 𝑐 ) and {𝑥 𝑙 } (𝑙 ∈ 𝑐) and add all messages
sent/received by corrupted parties to VIEWideal ;
Appendix A ADDITIVE SECRET SHARING
• Send the inputs of corrupted parties {𝑥 𝑙 } (𝑙 ∈ 𝑐) to the trusted
SCHEME third party;
Secret sharing schemes aim to securely distribute secret values • Receive the outputs of corrupted users {𝑦𝑙 } (𝑙 ∈ 𝑐) from the
amongst a group of participants. CGD employs the secret sharing trusted third party and add them to VIEWideal .
scheme proposed by [3], which uses additive sharing over Z232 . In
1 , ..., 𝐸𝑠 ∈ Z 32
this scheme, a secret value 𝑠𝑟𝑡 is split to 𝑠 shares 𝐸𝑠𝑟𝑡 𝑠𝑟𝑡 2
such that
1 2
𝐸𝑠𝑟𝑡 + 𝐸𝑠𝑟𝑡 𝑠
+ ... + 𝐸𝑠𝑟𝑡 ≡ 𝑠𝑟𝑡 mod 232, (42)
Meanwhile, a real instance of P is executed with actual inputs for
𝑖1
and any 𝑠 − 1 elements 𝐸𝑠𝑟𝑡 𝑖𝑠−1
, ..., 𝐸𝑠𝑟𝑡 are uniformly distributed. This all parties, and VIEWreal is created by gathering inputs of corrupted
prevents any participant who has part of the shares from deriving parties, messages sent/received by corrupted parties during the
the value of 𝑠𝑟𝑡, unless all participants join their shares. protocol and their final outputs. Once the simulation is finished, the
In addition, the scheme has a homomorphic property that allows security is proved by showing that VIEWideal is indistinguishable
efficient and secure addition on a set of secret values 𝑠𝑟𝑡 1, ..., 𝑠𝑟𝑡𝑠 from VIEWreal .
16
Appendix C SUPPLEMENTARY Table 8: Supplementary experimental results on the valida-
EXPERIMENTAL RESULTS tion loss and accuracy for different number of participants
on the MNIST dataset in the default setting.
Validation accuracy
CGD Local training CGD Local training
Validation loss
5
0.6
4 m=(10 × 1) 0.143 0.278 95.62% 92.01%
3
2
0.4 m=(100 × 1) 0.188 0.549 94.22% 82.89%
Centralized training
1 0.2 CGD training m=(1000 × 1) 0.208 1.566 93.76% 50.11%
0 Local training
0 500 1000 1500 2000 0 500 1000 1500 2000
Training iterations Training iterations
a) 𝑚 = (10 × 1) b) 𝑚 = (10 × 1)
1.0
7 Centralized training
CGD training
6 Local training 0.8
Validation accuracy
Validation loss
5
4 0.6
3
0.4
2
Centralized training
1 0.2 CGD training
0 Local training
0 500 1000 1500 2000 0 500 1000 1500 2000
Training iterations Training iterations
c) 𝑚 = (100 × 1) d) 𝑚 = (100 × 1)
1.0
7 Centralized training
CGD training
6 Local training 0.8
Validation accuracy
Validation loss
5
4 0.6
3
0.4
2
Centralized training
1 0.2 CGD training
0 Local training
0 500 1000 1500 2000 0 500 1000 1500 2000
Training iterations Training iterations
e) 𝑚 = (1000 × 1) f) 𝑚 = (1000 × 1)
17