Professional Documents
Culture Documents
Back Propagation Chen Zhong 2009
Back Propagation Chen Zhong 2009
Abstract— With the development of distributed computing concerns. More specifically, the neuroscientist in Dartmouth
environment, many learning problems now have to deal with College found it difficult to encourage the sharing of brain-
distributed input data. To enhance coorperations in learning, it imaging data because there was possibility that the raw data
is important to address the privacy concern of each data holder
by extending the privacy preservation notion to original learning could be misused or misinterpreted [3]. Therefore, there is a
algorithms. In this paper, we focus on preserving the privacy in strong motivation for learners to develop cooperative learning
an important learning model, multi-layer neural networks. We procedures with privacy preservation.
present a privacy preserving two-party distributed algorithm of In this paper, we focus on one of the most popular tech-
back-propagation which allows a neural network to be trained
niques in machine learning, multi-layer neural networks [4],
without requiring either party to reveal her data to the other.
We provide complete correctness and security analysis of our [5], in which the privacy preservation problem is far from
algorithms. The effectiveness of our algorithms is verified by being practically solved. In [6] a preliminary approach is
experiments on various real world datasets. proposed to enable privacy preservation for gradient descent
Index Terms— Privacy, Learning, Neural Network, Back- methods in general. However, in terms of multi-layer neural
Propagation networks, their protocol is limited as it is only for one simple
neural network configuration with one node in the output layer
and no hidden layers. Although their protocol is elegant in its
I. I NTRODUCTION
generality, it may be very restricted in practice for privacy
With the development of distributed computing environ- preserving multi-layer neural networks.
ment, many learning problems now have distributed input We propose a light-weight two-party distributed algorithm
data. In such distributed scenarios, privacy concerns often for privacy preserving back-propagation training with verti-
become a big issue. For example, if medical researchers want cally partitioned data1 (i.e., when each party has a subset of
to apply machine learning to study health care problems, features). Our contributions can be summarized as follows.
they need to collect the raw data from hospitals and the (1) Our paper is the first to investigate the problem of train-
follow-up information from patients. Then the privacy of the ing multi-layered neural networks over vertically partitioned
patients must be protected, according to the privacy rules in databases with privacy constraints. (2) Our algorithms are
Health Insurance Portability and Accountability Act (HIPAA), provably private in the semi-honest model [7] and light-weight
which establishes the regulations for the use and disclosure of in terms of computational efficiency. (3) Our algorithms in-
Protected Health Information [1]. clude a solution to a challenging technical problem, namely
A natural question is why the researchers would want to privacy preserving computation of activation function. This
build a learning model (e.g, neural networks) without first problem is highly challenging because most of the frequently
collecting all the training data on one computer. If there is used activation functions are infinite and continuous while
a learner trusted by all the data holders, then the trusted cryptographic tools are defined in finite fields. To overcome
learner can collect data first and build a learning model. this difficulty, we propose the first cryptographic method
However, in many real-world cases it is rather difficult to to securely compute sigmoid function, in which an existing
find such a trusted learner, since some data holders will piecewise linear approximation of the sigmoid function [8]
always have concerns like “What will you do to my data?” is used. In order to make our algorithms practical, we do
and “Will you discover private information beyond the scope not adopt the costly circuit evaluation based approaches [9].
of research?”. On the other hand, given the distributed and Instead, we use a homomorphic encryption based approach and
networked computing environments nowadays, collaborations the cryptographic scheme we choose is ElGamal [10]. (4) Both
will greatly benefit the scientific advances. The researchers analytical and experimental results show that our algorithms
have the interest to obtain the result of cooperative learning are light-weight in terms of computation and communication
even before they see the data from other parties. As a concrete overheads.
example, [2] stated that the progress in neuroscience could The rest of the paper is organized as follows. In Section
be boosted by making links between data from labs around
the world, but some researchers are reluctant to release their 1 For horizontally partitioned scenario (i.e., when each party holds a subset
data to be exploited by others because of privacy and security of data objects with the same feature set), there is a much simpler solution
that one party trains the neural network first and passes the training result to
T. Chen and S. Zhong are with the Computer Science and Engineering another party so that she can further train it with her data, and so on. So in
Department, The State University of New York at Buffalo, Buffalo, NY 14260, this paper we only focus on the vertical partition case, which is much more
U.S.A. (email:{tchen9, szhong}@cse.buffalo.edu) technically challenging.
2
II, we introduce the technical preliminaries including nota- Protocols in [27], [28] are also designed for privacy-
tions, definitions and problem statement. In Section III, we preserving neural-network-based computations. In particular,
present the novel privacy preserving back-propagation learning Chang and Lu [27] proposed a cryptographic protocol for
algorithm and two key component secure algorithms. Then non-linear classification with feed-forward neural networks.
we provide security analysis of the algorithm as well as the Barni et al. in [28] presented three algorithms with different
computation and communication overhead. In Section VI, with levels of privacy protections, i.e., protecting private weight
comprehensive experiments on various datasets, we verify the vectors, protecting private activation functions, and preventing
effectiveness and efficiency of our algorithm. After that, we data providers from injecting fake cells to the system.
conclude our paper. The fundamental difference between our work and the
protocols in [27] and [28] is that they work in different
A. Related Work learning scenarios. In [27] and [28], it is assumed that there
The notion of privacy preserving data mining was proposed is a neural network owner; this neural network owner owns
by two different papers ([11] and [12]) in the year 2000. Both a neural network, but does not have any data to train it. In
of the two papers addressed the problem of performing data addition, there are some data providers who have data that
analysis on distributed data sources with privacy constraints. can be used to train the neural network. The goal of [27] and
In [11], Agrawal et al. presented a solution by adding noise to [28] is to ensure that the neural network owner does not get
the source data, while in [12] Lindell and Pinkas used crypto- any knowledge about the data, and at the same time, the data
graphic tools to efficiently and securely build a decision tree providers do not get the knowledge embedded in the neural
classifier.After these two papers, a good number of data mining network (e.g., the node activation functions). In constrast, in
tasks have been studied with the consideration of privacy this paper we consider a totally different scenario in which
protection, for example classification [13], clustering [14], there are at least two neural network owners, each having his
[15], association rule mining [16], etc. own set of data. The two parties want to jointly build one
In particular, privacy preserving solutions have been pro- neural network based on all the data, but each party does not
posed for the following classification algorithms (to name want to reveal his own data. As a result, protocols in [27] and
a few): decision trees [12], [17], [18], Naive Bayes classi- [28] cannot be applied in the scenario that we are studying in
fier [19], [20], and SVM [21], [22], [23]. Generally speaking, this paper.
the existing works have taken either randomization based Moreover, [27] and [28] are of theoretical interests only;
approaches (e.g., [11], [21]) or cryptography based approaches they have not implemented their protocols, neither have they
(e.g., [17], [24], [19], [23]). Randomization based approaches, tested their protocols in any experiments. In contrast, we
by perturbing data, only guarantee a limited degree of privacy. have implemented our algorithm and carried out extensive
In contrast, cryptography based approaches provide better experiments. The results of our experiments show that our
guarantee on privacy than randomized based approaches, but algorithm is practical.
most of the cryptography based approaches are difficult to be
applied with very large databases, because they are resource II. T ECHNICAL P RELIMINARIES
demanding. For example, although Laur et al. proposed an In this section we give a brief review of the version of the
elegant solution for privacy preserving SVM in [23], their Back-Propagation Network (BPN) algorithm we consider [29]
protocols are based on circuit evaluation which is considered and introduce the piecewise linear approximation we use for
very costly in practice. the activation function. We also give a formal statement of
In cryptography, there is also a general-purpose technique problem with a rigorous definition of security. Then we briefly
called secure multi-party computation. The works of secure explain the main cryptographic tool we use, ElGamal [10].
multi-party computation originate from the solution to the
millionaire problem proposed by Yao [25], in which two
millionaires can find out who is richer without revealing the A. Notations for back-propagation learning
amount of their wealth. In [9], a protocol is presented which For ease of presentation, in this paper we consider a neural
can privately compute any probabilistic polynomial function. network of three layers, where the hidden layer activation
Although secure multi-party computation can theoretically function is sigmoid and the output layer is linear. Note that it
solve all problems of privacy-preserving computation, it is is trivial to extend our work to more layers.
too expensive to be applied to practical problems [26]. This Given a neural network with a-b-c configuration, one input
general solution is especially infeasible for cases in which vector is denoted as (x1 , x2 , · · · , xa ). The values of hidden
parties hold huge amounts of data. layer nodes are denoted as {h1 , h2 , · · · , hb }, and the values
h
There are few works on the problem of privacy preserving of output layer nodes are {o1 , o2 , · · · , oc }. wjk denotes the
neural networks learning (limited to [27], [28], [6]). The most weight connecting the input layer node k and the hidden layer
o
recent one is [6]. As discussed above, the difference between node j. wij denotes the weight connecting j and the output
our work and their is that we focus on privacy preserving neu- layer node i, where 1 ≤ k ≤ a, 1 ≤ j ≤ b, 1 ≤ i ≤ c.
ral networks and provide a light-weight algorithm applicable We use Mean Square Error (MSE) asPthe error function in
to more complex neural networks configurations, while their the back-propagation algorithm, e = 12 i (ti − oi )2 . For the
protocol is for gradient descent methods in general and thus neural networks described above, the partial derivatives are
loses some power for neural networks in particular. listed as (1) and (2), for future reference.
3
A value x ∈ Zp is randomly chosen as the private key. For each training iteration, the input of the privacy-
The corresponding public key is (G, q, g, h), where h = preserving back-propagation training algorithm is
gx . h{xA , xB }, t(x)i, where xA is held by party A, while
• Encryption. xB is held by party B. t(x) is the target vector of the
A message m ∈ G is encrypted as follows. A value current training data and it is known to both parties. The
r ∈ Zp is chosen as random. Then the ciphertext is output of algorithm is the connection weights of output layer
constructed as (C1 , C2 ) = (g r , m · hr ). and hidden layer, i.e., {wij o h
, wjk |∀k ∈ {1, 2, · · · , a}, ∀j ∈
• Decryption. {1, 2, · · · , b}, ∀i ∈ {1, 2, · · · , c}} .
The plain text is computed as The main idea of the algorithm is to secure each step in the
C2 m · hr m · hr non-privacy-preserving back-propagation algorithm, with two
= = = m. stages, feeding forward and back-propagation. In each step,
C1x g x·r hr
neither the input data from the other party nor the intermediate
Homomorphic Property. ElGamal scheme is homomorphic in results can be revealed. In particular, we apply Algorithm 2
that for two messages, m1 and m2 , an encryption of m1 + m2 to securely compute the sigmoid function, and Algorithm 3 is
can be obtained by an operation on E(m1 , r) and E(m2 , r) used to guarantee privacy preserving product computation.
without decrypting any of the two encrypted messages. To hide the intermediate results such as the values of hidden
Probabilistic Property. ElGamal scheme is also probabilistic, layer nodes, the two parties randomly share each result so
which means that besides clear texts, the encryption operation that neither of the two parities can imply the original data
also needs a random number as input. Let an encryption information from the intermediate results. Here by “randomly
of message m using public key (G, q, g, h) and a random share”, we mean that each party holds a random number and
number r be denoted as E(G,q,g,h) (m, r). For simplicity we the sum of the two random numbers equals to the intermediate
use notation E(m, r) instead of E(G,q,g,h) (m, r) in the rest of result. Note that with intermediate results randomly shared
this paper. among two parties, the learning process can still securely
In probabilistic encryption schemes, there are many encryp- carry on to produce the correct learning result (see correctness
tions for each message. ElGamal allows an operation that analysis in Section III-A.1).
takes one encrypted message as input and outputs another After the entire process of private training, without revealing
encrypted message of the same clear message. This is called any raw data to each other, the two parties jointly establish a
rerandomization operation. For instance, taking the encrypted neural network representing the properties of the union dataset.
message (C1 , C2 ) = (g r , m · hr ) as input, one can do the Our training algorithm for back-propagation neural net-
rerandomization and obtain another cyphertext of m as, works can be summarized as in Algorithm 1.
(C10 , C20 ) = (C1 · g s , C2 · hs ) = (g r+s , m · hr+s ). For clarity of presentation,
P in Algorithm 1, we separate the
o
procedure to compute i [−(ti −oi )wij ]hj (1−hj ) and explain
In this paper, messages are randomized by both parties. it in Algorithm 1.1.
Therefore, no individual party can decrypt any message by it- 1) Correctness: We show that if party A and party B follow
self. However, each party can partially decrypt a message [32]. Algorithm 1, they will jointly derive the correct weight update
When the message is partially decrypted by both parties, it is result in each learning round, with each of them holding a
fully decrypted. random share.
o
For any output layer weight wij , in step (2.1) of Algorithm
III. P RIVACY PRESERVING NEURAL NETWORK LEARNING 1 we have
In this section, we present a privacy-preserving dis- ∆1 wijo o
+ ∆2 wij
tributed algorithm for training the neural networks with back-
propagation algorithm. A privacy preserving testing algorithm = (oi1 − ti )hj1 + r11 + r21 + (oi2 − ti )hj2 + r12 + r22
can be easily derived from the feed-forward part of the privacy- = (oi1 − ti )hj1 + (oi2 − ti )hj2 + hj1 oi2 + hj2 oi1
preserving training algorithm. = −(ti − oi1 − oi2 )(hj1 + hj2 )
Our algorithm is composed of many smaller private com- = −(ti − oi )hj
putations. We will look into them in detail after first giving ∂e
an overview. = o
∂wij
o
= ∆wij .
A. Privacy preserving neural network training algorithm
Here we build the privacy preserving distributed algorithm Therefore the additive random shares of the two parties
for the neural network training process under the assumption can add up to the correct value for hidden layer connection
that we already have the algorithm to securely compute the weights.
piecewise linear function (Algorithm 2) and the algorithm to Now we show that the correctness is also guaranteed for
securely compute the product of two numbers held by two hidden layer weights.
parties (Algorithm 3). We will explain the two component In step (2.2) of Algorithm 1, it is easy to get that no matter
algorithms in detail later. which party holds the attribute xk , we always have
5
Input: hj1 , oi1 (hj2 , oi2 resp.) for party A (party B resp.)
Algorithm 1 Privacy preserving distributed algorithm for abtained inside Algorithm 1.
back-propagation training Output: random shares q1 , q2 for A and B resp.
Initialize all weights to small random numbers, and make
them known to both parties. 1. Using Algorithm 3, party A and B get random shares of
Repeat hj1 hj2 , r31 and r32 respectively, s.t. hj1 hj2 = r31 + r32 .
for all training sample h{xA , xB }, t(x)i do 2. For clarity, we nameP intermediate results as
Step1: feed forward stage hj1 − h2j1 − 2r31 = pP 1, i (−ti + oi1 )wij
o
= s1 ,
2 o
(1.1) For each hj2 − hj2 − 2r32 = p2 and i oi2 wij = s2 .
P hiddenh layer node hj , party A com-
putes k≤mA wjk xk , and party B computes
3. Party A computes p1 and s1 ; B computes p2 and s2 .
P h 4. Applying Algorithm 3, party A gets r41 and r51 , party B
mA <k≤mA +mB wjk xk .
(1.2) Using Algorithm 2, party A and B jointly compute gets r42 and r52 , s.t. r41 + r42 = s1 p2 , r51 + r52 = s2 p1 .
the sigmoid function for each hidden layer node 5. Name q1 = s1 p1 + r41 + r51 and q2 = s2 p2 + r42 + r52 .
hj , obtaining the random shares Party A computes q1 and party B computes q2 locally.
P hhj1 and hj2
respectively s.t. hj1 + hj2 = f ( k wjk xk ).
(1.3) For each P o output layer node o i , Party A computes
oPi1 = i wij hj1 and party B computes P o oi2 =
o
P o i wij h j2 , s.t. o i = oi1 + o i2 = i wij hj1 +
i wij hj2 . ∆1 wjkh h
+ ∆2 wjk
Step2: back-propagation stage
o
= xk q1 + xk q2
(2.1) For each output layer weight wij
= (s1 p1 + r41 + r51 + s2 p2 + r42 + r52 )xk
Party A and B apply Algorithm 3 to securely com-
pute the product hj1 oi2 , obtaining random shares = (s1 + s2 )(p1 + p2 )xk
X X
r11 and r12 respectively, s.t. r11 + r12 = hj1 oi2 . = ( (−ti + oi1 )wij o
+ o
oi2 wij )
Similarly they compute the random partitions of i i
hj2 oi1 , r21 and r22 , s.t. r21 + r22 = hj2 oi1 . (hj1 − h2j1 − 2r31 + hj2 − h2j2 − 2r32 )xk
Party A computes ∆1 wij o
= (oi1 − ti )hj1 + r11 + X
o
r21 and B computes ∆2 wij o
= (oi2 −ti )hj2 +r12 + = [−(ti − oi )wij ](hj1 + hj2 )(1 − hj1 − hj2 )xk
k
r22 . X
o
(2.2) For each hidden layer weight wjk h = [−(ti − oi )wij ]hj (1 − hj )xk
UsingPAlgorithm 1.1, party A and B jointly com- i
sigmoid function, but since in our paper the result of sigmoid Note that in Step 1 of Algorithm 2, party A must generate a
function is only an intermediate result for the whole learning new random number ri for encrypting each message. Although
process, here we keep it randomly shared. party B can get hrx2 at the end of the algorithm from
For ease of presentation, we write the algorithm under the E(mx2 , rx2 ) and the clear message mx2 , he has no way to
assumption that x1 and x2 are integers. Note that we can easily get other numbers of hri . On the other hand, party A can
rewrite the algorithm to allow real numbers with precision of always conduct the partial decryption without knowing which
a few digits after the dot. Actually the algorithm for factorial encrypted message B has chosen, because the decryption
numbers is the same in essence, given that we can shift the of ElGamal scheme is independent from random number ri
float point to the right to get integer numbers, or in other (please see details in Section II-D).
words, integers and factorials are easy to exchange to each
other in binary representation.
After the piecewise linear approximation, as the input is C. Privacy-preserving distributed algorithm for computing
splitted between two parties, both parties do not even know product
which linear function to use without the knowledge of which
range the input falls in. Our main idea is to let one party To securely compute product, some existing secure multi-
compute all possible values according to her own input and party computation protocols for dot product can be utilized
among those values, the other party picks the result they are (e.g. [33], [34]) by taking integer numbers as a special form
looking for. of vector input. Here we provide another option for privacy
As shown in Algorithm 2, party A first computes the preserving product computation, stated as Algorithm 3. The
sigmoid function values for different possible inputs of B. advantage of Algorithm 3 is that it can be very efficient for
After subtracting a same random number R from each of the some applications, when the finite field of input is small.
function values, A encrypts each of them (step 1). Actually Before applying Algorithm 2, a pre-processing step is
the random number generated by A is the final output for A, needed to convert the input of each party into a small integer.
the random share of the sigmoid function value. The random Assume party A holds integer M and party B holds integer
share for B is one of the clear texts hidden behind ElGamal N , both M and N are in the same range −n < M ≤ n,
scheme. So the remaining task of this algorithm is to let B −n < N ≤ n (Recall that n is a small integer). Basically
obtain that clear text, y(x1 + x2 ) − R, which corresponds to it follows a similar idea as of Algorithm 2. After running
her input without revealing the input of any party to each Algorithm 3, A and B get random numbers respectively which
other. As A does not know the input of B, she sends all the are summed to M · N .
encrypted results to B and lets B pick the one based on her
own input. Here since all the results are encrypted, B cannot Algorithm 3 Securely computing the product of two integers
get any information about A’s original data either. Because A Step 1: Party A generates a random number R and com-
can get to know the value of x2 by comparing the encrypted putes M · i − R for each i, s.t. −n < i ≤ n. Note
message chosen by B with all those generated by herself, a mi = M · i − R. A encrypts each mi using ElGamal
rerandomization is conducted by B, before B sends back her scheme and gets E(mi , ri ), where each ri is a new random
choice to A to protect B’s data, x2 . After A and B sequentially number. Then party A sends each E(mi , ri ) to party B in
decrypt, B gets her share of the final sigmoid function value the increasing order of i.
(step 3 and step 4). Step 2: Party B picks E(mN , rN ). She rerandomizes it and
sends E(mN , r0 ) back to A, where r0 = rN + s, and s is
Algorithm 2 Securely computing the piecewise linear sigmoid only known to party B.
function Step 3: Party A partially decrypts E(mN , r0 ) and sends the
Step 1: Party A generates a random number R and com- partially decrypted message to B.
putes y(x1 + i) − R for each i, s.t. −n < i ≤ n. Define Step 4: Party B finally decrypts the message (by doing par-
mi = y(x1 + i) − R. Party A encrypts each mi using tial decryption on the already partially decrypted message)
ElGamal scheme and gets E(mi , ri ), where each ri is a to get mN = M · N − R. Note R is only known to A and
new random number. Party A sends each E(mi , ri ) in the mN is only known to B. Furthermore, mN + R = M · N .
increasing order of i.
Step 2: Party B picks E(mx2 , rx2 ). She rerandomizes it and
sends E(mx2 , r0 ) back to A, where r0 = rx2 + s, and s is
only known to party B.
Step 3: Party A partially decrypts E(mx2 , r0 ) and sends the IV. S ECURITY ANALYSIS
partially decrypted message to B.
In this section, we explain why our algorithms are secure in
Step 4: Party B finally decrypts the message (by doing
the semi-honest model. Recall that in semi-honest model, the
partial decryption on the already partially decrypted mes-
parties follow the protocol and may try to analyze what she
sage) to get mx2 = y(x1 + x2 ) − R. Note R is only
can see during the protocol execution. To guarantee security in
known to A and mx2 is only known to B. Furthermore,
the semi-honest model, we must show that parties can learn
mx2 + R = y(x1 + x2 ) = f (x).
nothing beyond their outputs from the information they get
7
throughout the protocol process4 . A standard way to show this since weights are output of the training algorithm and η is
is to construct a simulator which can simulate what the party known to both parties as input, by the weight update rule,
o o o o o
can see in the protocol given only the input and output of the wij ← wij − η(∆1 wij + ∆2 wij ), ∆1 wij can be simulated for
protocol for this party. h
party B. The simulation of ∆1 wjk for party B is likewise.
Since the privacy preserving back-propagation training (Al- Similarly, we can construct the simulator for party A to
gorithm 1) uses Algorithm 2 and Algorithm 3 as building o h
simulate ∆2 wij and ∆2 wjk .
blocks, we first analyze the security of Algorithm 2 and Note that, in the above, the simulators for Algorithm 1
Algorithm 3, and then we conduct a security analysis of integrates the simulators for Algorithm 2 and 3 when the two
Algorithm 1 in the viewpoint of overall security level. component algorithms are called together with the simulations
of α1 , α2 , β1 , β2 in step 3. This completes the construction of
A. Security of Algorithm 2 and Algorithm 3 simulators for Algorithm 1.
TABLE I
Now we consider the back-propagation stage. In step (2.1)
DATASETS AND PARAMETERS
there are two calls of the Algorithm 3 for each hidden-output
connection. Time for one multiplication and three additions Dataset Sample Class Architecture Epochs Learning Rate
is also needed. Because multiplication time is much more kr-vs-kp 3196 2 36 − 15 − 1 20 0.1
Iris 150 3 4−5−3 80 0.1
significant than addition time, for simplicity, we also use S diabetes 768 2 8 − 12 − 1 40 0.2
to denote the time for one multiplication and three additions. Sonar 104 2 60 − 6 − 2 150 0.1
In step (2.2), Algorithm 3 are called 4 times, and thus the Landsat 6435 6 36 − 3 − 6 12 0.1
time for multiplications and additions is (c + 4)S. The total
execution time for back-propagation is bc(2 × T + S) + ab(4 ×
T + 4 × S + c × S) = (2bc + 4ab)T + (bc + 4ab + abc)S. VI. E VALUATION
Combining the time for the two stages, we obtain the run- In this section, we perform experiments to measure the
ning time of one round privacy preserving back-propagation accuracy and overheads of our algorithms. We have two sets
learning, (5ab + 2bc + abc)S + (2bc + 4ab + b)T . of experiments on the accuracy. In the first set, we compare
3) Communication overhead: In Algorithm 2 and 3, there the testing error rates in privacy-preserving and non-privacy-
are 2n + 2 messages have been passed between party A and preserving cases. In the second set, we distinguish two types of
B. With each message being s bits long, the communication approximation introduced by our algorithms: piece-wise linear
overhead is (2n + 2) × s. The total communication overhead approximation of sigmoid function and conversion of real
of one round learning in Algorithm 1 is the overhead caused numbers to fixed-point numbers when applying cryptographic
by calling algorithm 2 and 3, plus 2 × s in step 3, which is algorithms, and analyze how they affect the accuracy of the
(b + 2bc + 4ab)(2n + 2)s + 2s. back-propagation neural networks model. Our experiments on
overheads cover the computation and communication costs as
B. Analysis of accuracy loss well as comparisons with an alternative solution using general
There are two places in our algorithms where we introduce purpose secure computation.
approximation for the goal of privacy. One is that the sigmoid
function used in the neuron computing is replaced by a piece- A. Setup
wise linear function. The other approximation is introduced The algorithms are implemented in C++ and compiled with
by mapping the real numbers to fixed-point representations g++ version 3.2.3. The experiments were executed on a Linux
to enable the cryptographic operations in Algorithm 2 and (Red Hat 7.1) workstation with dual 1.6 GHz Intel processors
Algorithm 3. This is necessary in that intermediate results, and 1 Gb of memory. The programs used GNU Multiple
for example the values of neurons, are represented as real Precision Arithmetic Library in implementation of ElGamal
numbers in normal neural network learning, but cryptographic scheme. The results shown below are the average of 100 runs.
operations are on discrete finite fields. We will empirically The testing datasets are from UCI data set repository [36].
evaluate the impact of these two sources of approximation on We choose a variety of datasets, kr-vs-kp, Iris, Pima-indian-
the accuracy loss of our neural network learning algorithm in diabetes (diabetes), Sonar and Landsat with different charac-
Section VI. Below we give a brief theoretical analysis of the teristics, in the number of features, the number of labeled
accuracy loss caused by the fixed-point representations. classes, the size of datasets and data distributions. Different
1) Error in truncation: Suppose that the system, in which neural network models are chosen for varying datasets. Table I
neural network is implemented, uses µ bits for representation shows the architecture and training parameters used in our
of real numbers. Recall that before applying Algorithm 2 and neural network model. We choose the number of hidden nodes
Algorithm 3, we preprocess the input numbers into finite field based on the number of input and output nodes. This choice
that is suitable for crytographic operations. Assume that we is based on the criteria of having at least one hidden unit
truncate the µ-bit numbers by chopping off the lowest ν bits per output, at least one hidden unit for every ten inputs, and
and leaves the new lowest order bit unchanged. The precision five hidden units being a minimum. Weights are initialized
error ratio can be bounded by ² = 2µ−ν . as uniformly random values in the range [-0.1, 0.1]. Feature
2) Error in Feeding-forward stage: In the feed forward
values in each dataset are normalized between 0 and 1. The
stage of Algorithm 1, since only Algorithm2 is applied once,
privacy preserving back-propagation neural networks have the
the error ratio bound introduced by number conversion for
same parameters as the non-privacy-preserving version. For
cryptographic operations is ².
privacy preserving version, we use the key length of 512 bits.
3) Error in Output-layer Delta: In Step (2.1) of Algorithm
o
1, ∆1 wij = (oi1 − ti )hj1 + r11 + r21 , in which oi1 and hj1
are already approximated in preceding operations. Therefore, B. Accuracy loss
o
the error ratio bound for ∆1 wij is (1 + ²)2 − 1. We can obtain First we measure the accuracy loss of our privacy preserving
o algorithms when the neural network is trained with a fixed
the same result for ∆2 wij .
4) Error in Hidden-layer Delta: In Step (2.2) of Algorithm number of epochs (shown in Table I). The number of epochs
1, Algorithm 3 is applied 4 times sequentially. The error ratio set is based both on the number of examples and on the
h h
bound for ∆1 wjk and ∆2 wjk is (1 + ²)4 − 1. parameters (i.e., topology) of the network. Specifically, we
5) Error in Weight update: In Step (3) of Algorithm 1, the use 80 epochs for small problems involving fewer than 250
update of weights introduces no successive error. examples; 40 epochs for the mid-sized problems containing
9
20
evaluate the effects of these two approximation types by per-
18
forming a back-propagation learning on approximated piece-
16
wise linear sigmoid function, without cryptographic operations
(we call it sigmoid approximation test). This will eliminate
14
the security of the learning, but note that the purpose of this
12
modification is to measure the two kinds of accuracy loss, and
Error Rate(%)
10
thus we have to separate them.
8
Table III displays the training error rates and testing error
6
Iris(non−p.p) rates comparison of backpropatation learning without privacy
kr−vs−kp(non−p.p.)
4 Iris(p.p.) concern, versus sigmoid approximation test and privacy pre-
2
kr−vs−kp(p.p.)
serving learning with two types of approximations. In this set
0 of experiments, we make the training process stop when the
0 100 200 300 400 500 600 700 800
Epochs error is less than the error tolerance threshold 0.1 and if the
the number of epochs reaches the maximum number of 1000
Fig. 1. Error Rates on Training Epochs before converging, the training also stops. We call the latter
case a failure case of training.
From Table III, we can observe that both of the approx-
between 250 to 500 examples; and 20 epochs for larger imation types cause a certain amount of accuracy loss. The
problems. approximation brought by piecewise linear sigmoid function
Table II shows testing set error rates for both non-privacy- is less significant than by real numbers conversion to fixed-
preserving back-propagation neural network and privacy pre- point numbers. For example, in testing of dataset Iris, sigmoid
serving back-propagation neural networks. function approximation only contributes 0.70% out of 2.96%
Since the numbers of training epochs are fixed, the global in the total accuracy loss while, while conversion of real
error minimum may not be achieved when the training ends. numbers to fixed-point numbers causes the remaining accuracy
That explains the relatively high error rates for both privacy- loss, which is 2.26%.
preserving training and the non-privacy-preserving case. From
Table II, we can see that for experiments on different datasets,
the increase of test error rates by privacy-preserving algorithm D. Computation and Communication overhead
remain in small range, 1.26% for Landsat to 5.17% for Iris. We now examine the computation and communication over-
We extend the experiments by varying the number of head of our algorithm to verify that it is light-weight and
epochs and evaluate the accuracy of privacy preserving back- practical. In particular, we measure and record the overall
propagation neural networks on different training epochs. For time for privacy computation and the time for communication
clarity of presentation, we only show the results of dataset Iris between party A and B in the entire training process.
and kr-vs-kp in Figure 1. The result of other datasets have the Since we are the first to study privacy preserving neural
similar characteristics but with different epochs scales. From network learning with vertically partitioned data, no other
Figure 1, we can see clear that the error rates are decreasing complete algorithm is now available for this problem. Conse-
when the number of epochs increases. Furthermore, the error quently, in order to demonstrate that our algorithm is efficient
rates of privacy preserving back-propagation network decrease among possible cryptographic privacy preserving solutions, we
faster than the standard algorithm, which means increasing the consider a modification of Algorithm 1, in which we replace
number of epochs can help to reduce the accuracy loss. the calls to Algorithms 2 and 3 with executions of Yao’s
general-purpose two party secure computation protocol [9].
(We utilize the Fairplay secure function evaluation system [37]
C. Effects of two types of approximation on accuracy in our implementation of Yao’s protocol.) In the setting
In this set of experiments, we aim to analyze the causes described in Section VI-A, we compare our original Algorithm
of accuracy loss in our privacy preserving neural network 1 with this modified algorithm in terms of computation and
learning algorithms. communication overheads.
Recall that we have two types of accuracy loss, introduced Table IV shows the computation and communication over-
by sigmoid function approximation and mapping real numbers head measurements (in minutes) of our algorithm and the
to fixed point representation respectively. We distinguish and modified algorithm. It is clear that, in experiments on the
10
TABLE IV
VII. C ONCLUSION AND FUTURE WORK
C OMPUTATION OVERHEAD AND C OMMUNICATION OVERHEAD
In this paper, we present a privacy preserving algorithm
Our Algorithm Modified Algorithm
Dataset Comp. Comm. Comp. Comm. for back-propagation neural network learning. The algorithm
kr-vs-kp 63.49 39.20 1778.29 443.43 guarantees privacy in a standard cryptographic model, the
Iris 10.94 6.23 310.62 75.19 semi-honest model. Although approximations are introduced
diabetes 24.19 14.24 628.32 169.33 in the algorithm, the experiments on real world data show that
Sonar 317.43 197.37 8868.04 2193.13
Landsat 8.14 4.92 211.92 56.98 the amount of accuracy loss is reasonable.
Using our techniques, it should not be difficult to develop
the privacy preserving algorithms for back-propagation net-
TABLE V work learning with three or more participants. In this paper
C OMPUTATION OVERHEAD AND C OMMUNICATION OVERHEAD we have considered only the back-propagation neural network.
C OMPARISON WITH [27] AND [28] A future research topic is to extend our work to other types
of neural networks training.
Overhead Our Algoirthm Protocol in [27] Protocol in [28]
Computation 63.49 > 1057.10 > 111.65
Communication 39.20 > 219.24 > 71.38
R EFERENCES
[1] HIPPA, National Standards to Protect the Privacy of Personal Health
Information, http://www.hhs.gov/ocr/hipaa/finalreg.html.
[2] M. Chicurel, Databasing the Brain, Nature, 406, pp. 822-825, 24 August
five different datasets, our algorithm is significantly more 2000.
efficient than the algorithm integrated with Yao’s protocol. [3] Editorial. Whose scans are they, anyway?, Nature, 406, 443, 3 August
More precisely, our algorithm is 25.97 − 28.39 times more 2000.
[4] D. E. Rumelhart, B. Widrow, and M. A Lehr, The basic ideas in neural
efficient in terms of computation overhead, and 11.11 − 12.06 networks, Communications of the ACM, 37, 87-92, 1994.
times more efficient in terms of communication overhead. [5] R. P. Lippmann, An introduction to computing with neural networks,
IEEE Acoustics Speech and Signal Processing Magazine, vol. 4, pp. 4-
Now we compare the computation and communication over- 22, 1987.
head of our work with [27] and [28]. As we have mentioned, [6] L. Wan, W. K. Ng, S. Han, V. C. S. Lee, Privacy-preservation for gradient
descent methods, in Proc. of ACM SIGKDD, 2007, pp. 775-783.
the objectives of [27] and [28] are different from ours; the [7] O. Goldreich, Foundations of Cryptography, Volumes 1 and 2, Cambridge
authors of [27] and [28] have not implemented their protocols; University Press, 2001-2004.
neither have they performed any experiment. Consequently, [8] D.J. Myers and R.A. Hutchinson, Efficient implementation of piecewise
in order to compare their overhead with ours, we have to linear activation function for digital VLSI neural networks, Electron.
Letter, 1989, pp. 1662-1663.
implement these protocols by ourselves and measure their [9] A. Yao. How to Generate and Exchange Secrtes, in Proc. of the 27th IEEE
overheads. Symposium on Foundations of Computer Science, 1986, pp. 162-167.
[10] T. ElGamal, A Public-Key Cryptosystem and a Signature Scheme Based
Nevertheless, for practical purposes, we do not implement on Discrete Logarithms, IEEE Trans.Information Theory, vol. IT-31, pp.
and measure the complete protocols of [27] and [28]. The 469-472, no. 4, 1985.
[11] D. Agrawal and R. Srikant. Privacy preserving data mining, In Proc.
reason is that these protocols are so slow in practice that even ACM SIGMOD, 2000, pp. 439-450.
parts of them have significantly more overheads than ours. So [12] Y. Lindell and B. Pinkas. Privacy preserving data mining, in Proceedings
measuring the overheads of such parts is sufficient to demon- of the 20th Annual International Cryptology Conference on Advances
in Cryptology (CRYPTO 00), vol. 1880 of Lecture Notes in Computer
strate the better efficiency of our algorithm. Furthermore, if Science, pp. 36-44, Santa Barbara, Calif, USA, August 2000.
we measure the overheads of the complete protocols, [27] and [13] N. Zhang, S. Wang, W. Zhao, A New Scheme on Privacy-Preserving
[28] (especially [27]) will take more time than we can afford. Data Classification, In Proc. of the ACM SIGKDD, 2005, pp. 374-383.
[14] G. Jagannathan and R. N. Wright. Privacy-preserving distributed k-
Hence, we have implemented a key compnent of both [27] means clustering over arbitrarily partitioned data. In Proc. ACM SIGKDD,
and [28], namely secure evaluation of the activation function. 2005, pp. 593-599.
[15] J. Vaidya and C. Clifton. Privacy-preserving k-means clustering over
When we measure the overheads of [27] and [28], we count vertically partitioned data, In Proc. ACM SIGKDD, 2003, pp. 206-215.
the accumulative overhead of secure activation function evalu- [16] A. Evfimievski, R. Srikant, R. Agrawal, and J. Gehrke, Privacy pre-
ations only, and ignore the overhead of all other operations. We serving mining of association rules. In Proc. ACM SIGKDD, 2002, pp.
217-228.
compare the measured partial overhead of [27] and [28] with [17] Y. Lindell and B. Pinkas, Privacy preserving data mining, Journal of
the total overhead of our own algorithm, which includes the Cryptology, vol. 15(3):177-206, 2002.
computation and communication overheads of all components. [18] Wenliang Du and Zhijun Zhan. Using randomized response techniques
Table V shows the comparison results when we train the neural for privacypreserving data mining. In Proceeding of SIGKDD’03, pages
505?10. ACM, 2003.
network using dataset kr-vs-kp. We put the symbol “>” before [19] J. Vaidya and C. Clifton. Privacy preserving naive Bayes classifier for
the measured partial overheads of [27] and [28], to emphasize vertically partitioned data, In Proc. SIAM International Conference on
that these are not the total overheads. As we can see, for Data Mining, 2004.
[20] R. Wright and Z. Yang. Privacy-preserving Bayesian network structure
both [27] and [28], the measured partial overheads are already computation on distributed heterogeneous data. In Proc. 10th ACM
significantly more than the total overhead of our protocol, SIGKDD International Conference on Knowledge Discovery and Data
either in terms of computation or in terms of communication. Mining (KDD), pages 713? 718. ACM, 2004.
[21] Keke Chen and Ling Liu. Privacy preserving data classification with
Therefore, we can safely claim that our algorithm is more rotation perturbation. In Proceeding of ICDM’05, pages 589-592. IEEE
efficient than the protocols in [27] and [28]. Computer Society, 2005.
11
Tingting Chen received her B.S. and M.S. degrees in computer science
from the department of computer science and technology, Harbin Institute of
Technology, China, in 2004 and 2006 respectively. She is currently a Ph.D.
candidate at the department of computer science and engineering, the State
University of New York at Buffalo, U. S. A. Her research interests include
data privacy and economic incentives in wireless networks.