Professional Documents
Culture Documents
Shen2019 PDF
Shen2019 PDF
fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2901840, IEEE Internet of
Things Journal
Abstract—Machine learning (ML) techniques have been widely to deal with the challenges arising from processing require-
used in many smart city sectors, where a huge amount of data ments of IoT data, an increasing amount of innovations driven
is gathered from various IoT devices. As a typical ML model, by machine learning (ML) technology have been proposed.
support vector machine (SVM) enables efficient data classification
and thereby finds its applications in real-world scenarios such Among all ML models, support vector machine (SVM) is
as disease diagnosis and anomaly detection. Training an SVM a kind of prominent supervised learning models that can
classifier usually requires a collection of labelled IoT data from efficiently perform data classification. Thus, SVM is adopted
multiple entities, raising great concerns about data privacy. Most in many domains to solve real-world classification problems in
of the existing solutions rely on an implicit assumption that IoT-enabled smart cities. In the scenario of personal healthcare,
the training data can be reliably collected from multiple data
providers, which is often not the case in reality. fitness records monitored by wearable IoT sensors can be
To bridge the gap between ideal assumptions and realistic feeded to SVM classifiers for accurate diagnosis. In the
constraints, in this paper, we propose secureSVM, which is a domain of network intrusion detection, SVM classifiers can
privacy-preserving SVM training scheme over blockchain-based be used to identify anomalies from a series of traffic data
encrypted IoT data. We utilize the blockchain techniques to build derived from communications among IoT devices [2].
a secure and reliable data sharing platform among multiple data
providers, where IoT data is encrypted and then recorded on The construction of supervised ML classifiers (e.g., SVM) is
a distributed ledger. We design secure building blocks, such known as the training phase, which trains a specific classifier
as secure polynomial multiplication and secure comparison, by from a set of labelled samples. It is evidenced that the perfor-
employing a homomorphic cryptosystem, Paillier, and construct mance of ML classifiers increases as the order of magnitude
a secure SVM training algorithm, which requires only two of training data grows [3, 4]. Since the training dataset owned
interactions in a single iteration, with no need for a trusted
third-party. Rigorous security analysis prove that the proposed by a single entity (e.g., a hospitals or a network provider)
scheme ensures the confidentiality of the sensitive data for each is usually limited in terms of data volume and variety, there
data provider as well as the SVM model parameters for data has long been a need for an efficient mechanism to train
analysts. Extensive experiments demonstrates the efficiency of ML classifiers using a combination of datasets gathered from
the proposed scheme. multiple entities.
Index Terms—Privacy protection, encrypted IoT data, machine However, different entities are usually reluctant to share
learning, blockchain, homomorphic cryptosystem their data for training due to the concerns about data privacy,
integrity and ownership. First, many training tasks handle sen-
I. I NTRODUCTION sitive data (e.g., clinic data recorded by medical IoT devices),
which may result in sensitive and confidential information
I N recent years, smart cities are incorporating more and
more advanced Internet-of-Things (IoT) infrastructures,
resulting a huge amount of data gathered from various IoT
leakage during the training process. Second, data records
might be tampered with or unauthorized modified by potential
devices deployed in many city sectors, such as transportation, attackers during the sharing process, making the resulting
manufactory, energy transmission, and agriculture [1]. In order ML classifiers inaccurate. And finally, data providers can lose
control of their data as shared datasets are accessible to the
This work is partially supported by the National Key Research and participants and can be freely replicated by others.
Development Program of China under Grant 2018YFB0803405, the National
Natural Science Foundation of China under Grants 61602039 and 61872041, The data privacy issues of training ML classifiers have
the Beijing Natural Science Foundation under Grant 4192050, and CCF- attracted a significant amount of research attention from
Tencent Open Fund WeBank Special Funding. both academia and industry. Existing solutions [5–8] resort
M. Shen, X. Tang and L. Zhu are with the School of Computer Sci-
ence, Beijing Institute of Technology, Beijing 100081, China (e-mail: shen- to differential privacy or cryptographic algorithms to protect
meng@bit.edu.cn., tangguotxy@163.com, liehuangz@bit.edu.cn). Prof. Zhu the data privacy for each individual provider. They generally
is the corresponding author. rely on an implicit assumption that the training data can be
X. Du is with the Department of Computer and Information Sciences,
Temple University, Philadelphia PA19122, USA (e-mail: dxj@ieee.org). collected reliably from multiple data providers for further
M. Guizani is with the Department of Computer Science and Engineering, analysis, and pay little attention to the concerns of data
Qatar University, Qatar (e-mail: mguizani@ieee.org). integrity and ownership. This, however, is not always the case
Copyright
2012
c IEEE. Personal use of this material is permitted. However,
permission to use this material for any other purposes must be obtained from in reality due to potential attacks. To bridge the gap between
the IEEE by sending a request to pubs-permissions@ieee.org. ideal assumptions and realistic constraints, in this paper, we
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2901840, IEEE Internet of
Things Journal
exploit the blockchain techniques to build a secure data sharing the related work in Section II and describe the preliminaries in
platform. Blockchain is essentially a distributed filing system Section III. Then, we present the system overview in Section
designed to allow the sharing of tamper-proof records among IV and describe the proposed scheme in Section V. After that,
multiple parties [9]. The permanent and immutable records we formally analyze the security issues in Section VI and
on blockchain enable data auditing, which can be used to experimentally evaluate the proposed scheme in Section VII.
confirm the ownership of data records. This also facilitates Finally, we conclude this paper in Section VIII.
quantification of the contribution of individual data providers
and designing incentive strategies to encourage sharing of II. R ELATED W ORK
training data. In general, supervised learning consists of two phases: the
Although promising, it is still a challenging task to incorpo-
training phase which learns an ML model from a given set of
rate blockchian into ML training process. The first challenge
labelled samples, and the classification phase which outputs a
is designing an appropriate training data format that can be
label with maximum likelihood for a given sample. According-
easily accommodated on blockchain while protecting the data
ly, existing studies on privacy-preserving ML can be broadly
privacy of each individual provider. The second challenge lies
classified into two categories, namely privacy-preserving ML
in the construction of a training algorithm that builds accurate
training and privacy-preserving ML classification.
SVM classifiers using the data recorded on blockchian without
revealing sensitive information.
To address the above challenges, we propose secureSVM, A. Privacy-preserving ML training
which is a privacy-preserving SVM training scheme using The training of ML models usually get multiple parties
blockchain-based encrypted IoT data. secureSVM employs involved, thus the privacy goal is to train a model with a
a public-key cryptosystem, Paillier, to protect the privacy of collection of the data from all parties while protecting the data
IoT data, where data providers encrypt their data locally by of an individual party from being learned by other parties. Our
their own private keys. Paillier is an additive homomorphic work falls in this category, where plentiful solutions have been
cryptosystem and is more efficient than other algorithms (e.g., proposed during the last decade [6, 11–17].
Goldwasser-Micali, RSA and Rabin) in term of encryption and Differential privacy (DP) is a commonly used technique
decryption efficiency [10]. to protect data privacy in the publishing stage [6]. More
Conventional SVM optimization algorithms for unencrypted specifically, DP ensures the security of the published data by
data, such as the sequential minimal optimization algorithm, adding carefully-calculated perturbations to the original data.
require intensive comparison operations, which result in a Abadi et al. [6] proposed a DP-based deep learning scheme,
huge amount of interactions in the scenario of encrypted which enables multiple parties to jointly learn a neural network
data. Therefore, we resort to a relatively simple optimization while protecting the sensitive information of their datasets.
algorithm named gradient descent as the optimization algo- The DP-based solutions can achieve high computational
rithm in secureSVM. Since gradient descent still contains efficiency, as all calculations are performed over plaintext data.
basic operations such as polynomial operations and compar- However, the resulting ML models can be inaccurate as the
isons, we design secure polynomial multiplication and secure perturbations inevitably reduce the quality of training data.
comparison by using the homomorphic properties of Paillier. In addition, perturbations may not protect the data privacy
With the above building blocks, an iteration of secureSVM completely, as a bounded amount of sensitive information
requires only two interactions, which significantly reduces the about each individual training data is exposed. For instance,
computation and communication overhead. a parameter of privacy budget is employed in DP to balance
The main contributions of this paper are as follows. data privacy and model accuracy: a larger budget helps protect
• We employ the blockchain techniques to enable secure the data privacy, while reducing the model accuracy.
and reliable IoT data sharing. Each IoT data provider In order to achieve better privacy guarantees in ML training,
can encrypt the data instances locally by its own private homomorphic cryptosystem (HC) is introduced to train ML
key, and then record the encrypted data on blockchain via models based on encrypted data. It is inherent natures of
specially formatted transactions. HC that allow computations on ciphertext while preserving
• By using the homomorphic cryptosystem, Paillier, we their correctness. Several secure algorithms based on HC have
design secure building blocks, such as secure polynomial been proposed for training different machine learning models,
multiplication and secure comparison, and construct a including SVM [11], logistic regression [12, 13], decision trees
secure SVM training algorithm, which requires only two [15], and Naive Bayes [14]. The authors in [11] developed
interactions in a single iteration, with no need for a trusted protocols for secure addition and substraction based on Paillier
third-party. and constructed a private SVM training algorithm. Since
• Extensive experiments are conducted to show that our some computations are not supported by Paillier, they employ
scheme is able to securely train SVM classifiers with an authorization server a trusted third-party for computation
a relatively high accuracy. Through rigorous security outsourcing.
analysis, we show that the proposed scheme ensures the Compared with the DP-based solutions, the HC-based ones
confidentiality of the sensitive data for each data provider achieve higher data privacy at the cost of low efficiency. The
as well as the SVM model parameters for data analysts. reasons lie in two aspects: 1) Although the fully homomorphic
The rest of this paper is organized as follows. We summarize encryption (FHE) enables complex computations (e.g., with
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2901840, IEEE Internet of
Things Journal
TABLE I
a arbitrary combinations of additions and multiplications) N OTATIONS
on ciphertexts, the current implementations of FHE lead to
prohibitively high cost in terms of encryption and computation, Notations Explanation Notations Explanation
making them impractical in real-world applications, and 2) D Dataset d Dataset dimension
The partially homomorphic encryption (PHE) is more practical xi i-th record in dataset yi Class lable
compared with FHE but only supports a single type of oper- ∇t Gradient λ Learning rate
ation (i.e., addition or multiplication). Therefore, in order to m Size of dataset D w, b The model parameters
enable complex computations, existing solutions usually rely [[m]] the encryption of m φ(N ) The Euler phi-function
under Paillier
on a trusted third-party (e.g., the authorization server [11]) or
lead to inaccurate models by approximating complex equations
with a single type of computation [16, 17].
IoT data on blockchain. In order to perform training tasks
based on encrypted data, we construct secure protocols of two
B. Privacy-preserving ML classification crucial operations in SVM training, i.e., secure polynomial
In a classification-as-a-service scenario, a data sample for multiplication and secure comparison. Using these building
classification and the ML model are usually belonged to blocks, we propose a privacy-preserving SVM training algo-
two different parties. The data owner wants to know the rithm, secureSVM, without the need for a trusted third party.
classification result but is reluctant to expose the sensitive Since the training process is based on Paillier, secureSVM is
data sample to an untrusted model owner for obtaining clas- able to train SVM classifiers without loss of accuracy.
sification service. Meanwhile, the model owner may be also We utilize two commonly-used security definitions as our
reluctant to reveal the information of the classification model security goals: secure two-party computation [21] and mod-
as it is a highly valuable asset to the service provider. ular sequential composition [22]. We demonstrate that with
To protect the privacy of both sides, efforts have been secureSVM, each data provider is unable to learn any knowl-
dedicated to developing efficient solutions [5, 8, 18–20]. Wang edge about the data of other data providers, while the data
et al. [19] proposed an scheme for classifying encrypted analyst’s model parameters is also kept secret from data
images based on multi-layer learning. They rely on an as- providers during the training process.
sumption that the image content should be protected, whereas
the classifier is not confidential. Zhu et al. [20] proposed III. P RELIMINARIES
an privacy-preserving nonlinear SVM classification scheme A. Notation
for online medical prediagnosis. With their design, both the
A dataset D is an unordered set of m records with the
sensitive information of each individual’s health record and the
size of |D|, where xi is the i-th record in D and yi is the
SVM model are protected. Rahulamathavan et al. [8] proposed
label corresponding to xi . Define w and b as two relevant
a privacy-preserving SVM data classification scheme that
parameters of SVM. ∇t is the descend gradient in the iterative
performs secure classification for both two-class and multi-
execution of the SVM training algorithm, λ is the learning
class where the server is unable to learn any knowledge about
rate. In this paper, we use a partial homomorphic cryptosystem
clients’ input data samples while the server side classifier
named Paillier as the cryptosystem, and let [[m]] represent
is also kept secret from the clients during the classification
the encryption of m under Paillier. Table I summarizes the
process. Several works leveraged the HC techniques to develop
notations used in the following sections.
a series of classification protocols for applying typical ML
classifiers (e.g., decision trees, hyperplane decision, and Naive
Bayes) on encrypted data [5, 18]. B. Homomorphic Cryptosystem
These studies explored a set of generic classifiers and Cryptosystems consists of three algorithms: key generation
building blocks to construct privacy-preserving classification (KeyGen), encryption (Enc), and decryption (Dec). A pair
schemes. The computation of ML classifiers training phase of keys (PK, SK) are adopted in public-key cryptosystems,
is more complex than classification phase, these proposed i.e., the public key PK for encryption and the private key SK
building blocks that can build classification algorithms can for decryption.
be powerless for complex training algorithms. Homomorphic is a property of cryptosystem if it is capable
of mapping the operations over ciphertext to the corresponding
C. The novelty of this paper plaintext, with no need for the knowledge of the decryption
key. Formally, we define homomorphic property of cryptosys-
In this paper, we combine Paillier cryptosystem and
tem in Definition 1.
blockchain techniques together to address the concerns about
data privacy, integrity and ownership, when training SVM Definition 1. (homomorphic) [10]. A public-key encryption
classifiers using IoT data from different providers. More pre- scheme (Gen, Enc, Dec) is homomorphic if for all n and all
cisely, IoT data of each provider is first encrypted with Paillier (PK, SK) output by Gen (1n ), it is possible to define groups
and then recorded on a distributed ledger. Data analysts who M, C (depending on PK only) such that:
want to train SVM classifiers can get access to the encrypted (i) The message space is M, and all ciphertexts output by
data by communicating with the corresponding data providers. Encpk are elements of C.
Note that data analysts can never obtain the plaintext of any (ii) For any m1 , m2 ∈ M, any c1 output by
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2901840, IEEE Internet of
Things Journal
Encpk (m1 ), and any c2 output by Encpk (m2 ), it holds that Though Blockchain has several advantages over other systems,
Decsk (o (c1 , c2 )) = σ (m1 , m1 ). it is far from being perfect when serving as a data sharing
platform due to its vulnerability of data privacy to potential
We apply Paillier a partial homomorphic cryptosystem in
attacks. Originally, all transactions are recorded in blocks
our scheme. Paillier is a public key cryptography scheme with
in the form of plaintexts, making sensitive information in
homomorphic property that allows two important operations,
transactions exposing to all participants including adversaries
secure addition and secure subtraction. Paillier based on an
[24]. Therefore, security and privacy concerns should be
assumption related to the hardness of factoring. Let p and q are
carefully addressed when employing blockchain as a data
n-bit primes, N = pq. The public key is N , and the private key
sharing platform.
is (N, φ(N ))1 in Paillier. Encryption function in paillier is c :=
m
[[(1 + N ) rN modN 2 ]], where m ∈ ZN . Decryption function
φ(N )
modN 2 −1] −1 IV. P ROBLEM D ESCRIPTION
in paillier is m := [[ [c N × φ(N ) modN ]]. For
more details about paillier, we refer the reader to [10]. In this section, we describe the problem of secure SVM
training over encrypted data gathered from multiple parties,
including the system model, threat model, and design goals.
C. Support Vector Machine
SVM is a supervised learning model which gives the A. System Model
maximum-margin hyperplane that might classify the test data
We envision a data-driven IoT ecosystem, shown in Figure
[23]. The form of the hyperplane is expressed as y = wT x+b,
1, including IoT devices, IoT data providers, blockchain-based
(xi , yi ) ∈ D. wT xi +b ≥ 1, yi = +1; wT xi +b ≤ 1, yi = −1.
IoT platform, and IoT data analysts.
The optimization problem of the primary of SVM as follows:
• IoT devices are capable of sensing and transmitting
1 valuable data through wireless or wired networks, such
min ||w||2
w,b 2 (1) as ZigBee, 3G/4G, and WiFi. The data can cover a wide
s.t, yi wT xi + b ≥ 1, i = 1, 2, . . . . . . m.
range of real-world applications in smart cities, from
environmental conditions to physiological information.
D. Blockchain Systems Note that IoT devices will not participate in the data
sharing and analysis processes due to their limited com-
Blockchain is an open and distributed ledger taking the putational capabilities.
form of a list of blocks, which are originally designed for • IoT data providers collect all pieces of data from the
recording transactions in cryptocurrency systems, e.g., Bitcoin. IoT devices within their own domains. As a valuable
It enables reliable transactions among a group of untrusted asset of data providers, IoT data usually contains sensitive
participants. In recent year, many different kinds of blockchain information. Thus, each data provider encrypts its IoT
platforms, such as the HyperLedger, Ethereum, and EOS, data using partially homomorphic encryption and then
have been proposed and applied to a variety of application records the data on blockchain.
scenarios. According to the access restriction on blockchain • Blockchain-based IoT platform serves as a distributed
users, blockchain platforms can be roughly classified into three database, where the encrypted IoT data gathered from all
categories, namely public blockchains, private blockchains, data providers are recorded in a shared ledger. Through
and consortium blockchains. the built-in consensus mechanism, we can ensure that IoT
Blockchain has several desirable features, making it inher- data is shared in a secure and temper-proof way.
ently suitable for reliable data sharing: • IoT data analysts aims at getting a deep insight into the
• Decentralized. As a distributed ledger, blockchian is built IoT data recorded in the blockchain-based platform, by
on a peer-to-peer network, with no need for a trusted taking full advantages of emerging analyzing techniques.
third-party or a central administrator. Multiple replicas Data analysts should communicate with corresponding
of the data recorded in the ledger exist in the system, data providers to obtain parameters of training SVM
avoiding data loss in case of single point-of-failure. classifiers.
• Tamper-proof. Blockchain employs consensus protocols,
such as Proof-of-Work (PoW), to manage the right of B. Threat Model
creating new blocks. Thus, data manipulating is usually
impractical in terms of computational overhead, making According to the system model described in Figure 1,
the data recorded in blocks unalterable. there exist multiple potential threats to each kind of entities
• Traceability. The transactions between two parties in a
as well as their interactions. Since our efforts are dedicated
blockchain system can be easily verified by the rest to designing a privacy-preserving scheme for training SVM
participants. Any transaction can be tracked and the data models over multiple IoT providers, we only consider the
owner can benefit in real-time, e.g., getting paid for every threats to data privacy during the interaction between data
bit of the data that is used by a third party. providers and data analysts.
We consider the data analyst as an honest-but-curious
1 Let N > 1 be an integer. Then Z ∗ is an abelian group under multiplica-
∗N
adversary. That is, the data analyst would honestly follow
tion modulo N. Define φ(N )def ZN , the order of the group Z ∗ . the predesigned ML training protocols, but it may be curious
N
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2901840, IEEE Internet of
Things Journal
A. System Overview
C. Design Goals For simplicity, we assume that a single data analyst aims at
We allow any two or more IoT data providers conspire with training SVM models using the data collected from multiple
IoT data analyst to steal the privacy of other participants. We IoT data providers. The system overview of secureSVM is
make the following assumptions: Each participate as a honest- illustrated in Figure 2. Each IoT data provider preprocesses
but-curious adversary performs protocol honestly but may have IoT data instances, encrypts them locally using their own
interest in the private information of other domains. Any two private keys, and records them in a blockchain-based shared
or more participates may collude with each other. As passive ledger by generating transactions. Existing key management
adversaries, they do follow the protocol but try to infer other’s mechanisms [27–29] can be employed to manage the encryp-
privacy as much as possible from the values they learn. tion capabilities of data providers.
Our scheme aims at protecting privacy of each participant The data analyst who want to train an SVM model can
and training an SVM model securely. To be specific, the get access to the encrypted data recorded in the global ledger,
privacy of each IoT data provider is their IoT records, and and assemble a secure training algorithm with several building
IoT data analyst is the SVM model parameters. We specify blocks, such as secure comparison and secure polynomial
our security goals as follows: addition. During the training process, interactions between
1) When facing honest-but-curious adversaries, the IoT data the data analyst and each data provider are necessary for
analyst and each IoT data provider’s privacy are confi- exchanging intermediate results.
dential.
2) When facing any two or more parties collude with each B. Encrypted Data Sharing via Blockchain
other, IoT data analyst and each IoT data provider’s Now, we describe the data sharing process. To facilitate
privacy are confidential. model training, without loss of generality, we assume that
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2901840, IEEE Internet of
Things Journal
the data instances for a same training task has been locally Algorithm 1 Gradient Descent Optimization Algorithm
preprocessed and represented with a same feature vectors [7]. Input: Training set D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )}, learn-
In order to store the encrypted IoT data in the blockchain, ing rate λ, maxIters T .
Output: w∗ ,b∗ .
we define a special transaction structure. The transaction 1: while cost < precision or t < T do
format mainly consists of two fields: input and output. The 2: Compute ∇t+1 by Eq. (3).
input field includes the data provider’s address, the encrypted 3: Update wt+1 ,bt+1 by ∇t+1 .
data, and the IoT device type. The corresponding output field 4: Compute cost by Eq. (2).
includes the data analyst address, the encrypted data, and the 5: end while
6: return w ∗ ,b∗ .
IoT device type. The addresses in both fields are a 32-byte hash
value. The encrypted data is derived from the homomorphic
encryption, paillier. The length of each encrypted data instance
(3), where I| (wx + b) < 1| is the indication function, and if
stored in the blockchain is 128 bytes, based on the assumption
(wx + b) < 1 is true, the value is 1, otherwise it is 0. The steps
that the private key length is 128 bytes. The IoT device type
of the SVM model training algorithm using gradient descent
segment has a length of 4 bytes.
are shown in Algorithm 1.
After constructing a new transaction, a node representing the
m
data provider in the blockchain network broadcasts it in the X
∇t = λwt − I| (wxi + b) < 1| × xi yi (3)
P2P network, where the miner nodes can verify the correctness
i=1
of the transaction. Using existing consensus algorithms, such
2) Secure Polynomial Multiplication: In order to securely
as the PoW mechanism, a specific miner node is qualified
train the SVM model, we describe the secure polynomial
for packaging the transaction in a new block and adding the
multiplication used in our secure SVM training algorithm.
block to the existing chain. Note that each block may consist
Using Paillier’s homomorphic property, we can obtain se-
of multiple transactions.
cure additions and secure subtractions straightforward. The
additively homomorphic properties in Paillier can be de-
C. Building Blocks scribed as [[m1 + m2 ]] = [[m1 ]] × [[m2 ]] modN 2 , and
As described in Section IV, we should ensure privacy of the subtraction homomorphic properties canbe described as
multiple IoT providers, and we aim to designing a privacy- [[m1 − m2 ]] = [[m1 ]] × [[(m2 )]]−1 modN 2 . [[m]]−1 is the
preserving scheme for training SVM models over several modular multiplicative inverse, which preforms that [[m]] ×
private datasets provided by several IoT providers. We now [[m]]−1 modN 2 = 1 in paillier. [[m]]−1 can be computed by
introduce the basic building blocks to achieve these goals. function φ(N ), [[m]]−1 = [[m]]φ(N )−1 . Thereby, the secure
1) Gradient Descent For SVM: Several optimization meth- polynomial multiplication can be obtained through ciphertext
ods is able to solve the model parameters of the SVM in Eq. manipulation, as shown in Eq. (4).
(1). These applicative optimization algorithms include the se- [[am1 + bm2 ]] = [[m1 a ]] × [[m2 b ]] modN 2
(4)
quential minimal optimization (SMO) and the gradient descent
(GD) algorithms. SMO is an optimization approach for the The security of secure polynomial multiplication construct-
SVM dual quadratic program [30] and performs well for linear ed by Paillier depends on Paillier’s statistically indistinguish-
SVM and data sparseness. However, the calculation steps of able. Thus, secure polynomial multiplication is statistically
SMO are complex due to a large number of comparisons, indistinguishable, as Paillier is statistically indistinguishable
dot products, and division operations [11]. Applying SMO in [10].
encryption domain may cause horrible computation cost and 3) Secure Comparison: The secure comparison in our
communication cost. SVM optimization algorithm based on secure SVM is defined as comparing an encrypted (and
GD is simple and efficient, which involves a small amount unsigned) number [[m]] with a constant 1, as shown in Table
of comparison and vector multiplication. Therefor, we choose II. For parties A and B participating in the secure comparison
GD as the optimization algorithm in secure SVM training algorithm, neither party can obtain any information other than
algorithm to optimize the SVM model parameters in Eq. (1). the information implied by the input. Our secure comparison
GD converts the SVM primary into an empirical loss protocol is exhibited in Algorithm 2. We analyze its correct-
minimization problem with a penalty factor, as shown in Eq. ness here, and the security proof is detailed in Section VI.
(2) correctness: |r3 − r2 | < r1 ↔ |r3r−r 1
2|
< 1. (ar1 + r2 ) =
r3 −r2
m (r1 + r3 ) ↔ (a − 1) = r1 . If (ar1 + r2 ) > (r1 + r3 ).
1 2
X
Because a is an integer, we can infer that (a − 1) > 1 → a >
min ||w|| + C L (w, b, (xi , yi )) (2)
w,b 2 1, otherwise a ≤ 1.
i=1
where the right part of the equation is the hinge-loss function, Proposition 1 (Security of Secure Comparison). Algorithm 2
m m
C
P
L (w, b, (xi , yi )) = C
P
max {0, 1 − yi (wxi − b)}, is secure in the honest-but-curious model.
i=1 i=1
and C is the misclassification penalty that usually takes the
value m1
. D. Training Algorithm of SecureSVM
The basic form of GD is: xn+1 = xn − λ∇Grad (xn ). The For secure optimization model parameters, we design a
gradient calculation formula of the SVM is exhibited in Eq. privacy-preserving SVM training algorithm. Suppose there are
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2901840, IEEE Internet of
Things Journal
TABLE II
T HE C ONDITIONS OF S ECURE C OMPARISON honest-but-curious adversaries, and modular sequential com-
position provides a way to build private protocols in a modular
Input A Input B Output A Output B way. We present our security proof according to the ideas of
[[a]], 1 SK, PK (a < 1) - these two definitions. For more details, we refer the reader to
[21] for secure two-party computation and [22] for modular
Algorithm 2 Secure Comparison sequential composition.
Input A: [[a]],1. We follow the notation in the literature [5]: Let F =
Input B: A pair of keys (PK,SK) (FA , FB ) be a polynomial function and π a protocol com-
Output B: (a < 1). puting F ; a is A’s input and b is B’s input and A
1: A uniformly picks three positive integers r1 , r2 and r3 , where and B desire to compute F (a, b), using π; The view
|r3 − r2 | < r1 .
of A is the tuple viewAπ (λ, a, b) = (λ; a; m1 , m2 , ..., mn )
2: A sends [[ar1 + r2 ]] and [[r1 + r3 ]] to B.
3: B decrypts and compares (ar1 + r2 ) with (r1 + r3 ), and tell the where m1 , m2 , ..., mn are the messages received by A
results to A. when the execution. We define the view of B similarly.
4: a > 1, if and only if (ar1 + r2 ) > (r1 + r3 ); otherwise a ≤ 1. A’s and B’s outputs are outputπA (a, b) and outputπB (a, b)
5: return (a < 1) respectively. The global output of π is outputπ (a, b) =
(outputπA (a, b), outputπB (a, b)).
Algorithm 3 Privacy-Preserving SVM Training Algorithm Definition 2. (Secure Two-Party Computation) [21]. A pro-
P ’s Input: D = {(x1 , y1 ) , (x2 , y2 ) , . . . , (xm , ym )}. tocol π privately computes f with statistical security if for
C’s Input: learning rate λ, maxIters T , a pair of keys (PKc , SKc ). all possible inputs (a, b) and simulators SA and SB hold the
C’s Output: w∗ ,b∗ . following properties2 :
1: C initializes w 1 , b1 .
π
2: while cost < precision or t < T do {SA , f2 (a, b)} ≈ {viewA (a, b), outputπ (a, b)}
3: for i = 1 to n do {f1 (a, b), SB } ≈ {outputπ (a, b), viewB
π
(a, b)}
4: C sends [[wt ]] to Pi .
5: for j = 1 to m do The basic idea of sequential modular composition is that: n
6: Pi computes [[yj (wxj − b)]] by secure polynomial participants run a protocol π calling to an ideal functionality
multiplication. F , e.g., A and B compute F privately by sending their inputs
7: Pi compares [[yj (wxj − b)]] with 1 by secure com- to a trusted third-party and receiving the result. If we can prove
parison. that the protocol π satisfies secure two-party computation, and
8: Pi updates ∇it+1 and costit+1 and send to C. a protocol ρ can achieve the same function as F privately, then
9: end for we can replace the ideal protocol for F by the protocol of ρ
10: C decrypts ∇it+1 and costit+1 , and updates wt+1 ,bt+1 . in π; the new protocol π ρ is then secure under the honest-but-
11: end for curious model [5, 22].
12: end while Theorem 1. (Modular Sequential Composition) [21]. Let
13: return w ∗ , b∗ . F1 , F2 , . . . , Fn be two-party probabilistic polynomial time
functionalities and ρ1 , ρ2 , . . . , ρn protocols that compute re-
spectively F1 , F2 , . . . , Fn in the presence of semi-honest ad-
n data providers P and a single IoT data analyst C. Algorithm versaries. Let G be a two-party probabilistic polynomial time
3 specifies the secure SVM training algorithm. In Algorithm functionality and π a protocol that securely computes G in
3, the sensitive data of IoT data providers and the SVM the (F1 , F2 , . . . , Fn ) − hybrid model in the presence of semi-
model parameters are confidential. Except for legal input, each honest adversaries. Then, π ρ1 ,ρ2 ,...,ρn securely computes G in
participant cannot infer any sensitive information of other the presence of semi-honest adversaries.
participants from the intermediate results of the algorithm’s
running process when facing honest-but-curious adversaries or A. Security Proof for Secure Comparison
any collusion. The security proofs for each Algorithm 3 are Two roles are involved in Algorithm 3: A and B. The
given in Section VI. function is F : F ([[a]]B , 1, PKB , SKB ) = (φ, (a < 1)).
π
Proposition 2 (Security of Privacy-Preserving SVM Training Proof of Proposition 1: The view of A is viewA =
Algorithm). Algorithm 3 is secure in the honest-but-curious ([[a]]B ,
model. PKB ). As A does not receive any message form B, her
view only consists of her input and three random numbers
π
she produces. Hence, the simulator SA ((a, 1); F (a, 1)) =
VI. S ECURITY A NALYSIS π
viewA ([[a]]B , 1, PKB ). [[a]]B is encrypted by PKB , and the
This section presents the security analysis under the known confidentiality of [[a]]B is equivalent to the used cryptosystem
ciphertext model and the known background model. We adopt Paillier. Therefore, A cannot infer the value directly.
two security definitions: secure two-party computation [21] 2 ≈ denotes computational indistinguishability against probabilistic poly-
and modular sequential composition [22]. A Protocol that nomial time adversaries with negligible advantage in the security parameter
satisfies secure two-party computation is secure in the face of λ.
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2901840, IEEE Internet of
Things Journal
π TABLE III
The view of B is viewB = ((ar1 + r2 ), (r1 + r3 ), PKB , S TATISTICS OF DATASETS
π
SKB ). SB runs as follows:
• Generates l random coins and obtains [[(m1 , m2 , . . . , ml Datasets Instances Attributes Discrete Numerical
number number attributes attributes
)]]B by PKB , where l is the length of a.
• B uniformly picks three positive integers c1 , c2 and c3 , BCWD 699 9 0 9
where |r3 − r2 | < r1 . HDD 294 13 13 0
• Outputs ((mc1 + c2 ), (c1 + r3 ), PKB , SKB )
The distribution of (a, r1 , r2 , r3 ) and (m, c1 , c2 , c3 ) are
A. Experiment Setup
identical, so the real distribution ((ar1 + r2 ), (r1 +
r3 ), PKB , SKB ) and the ideal distribution ((mc1 + c2 ), (c1 + Testbed. In our design, each IoT data provider collects all
c3 ), PKB , SKB ) are statistically indistinguishable. pieces of data from the IoT devices in its own domain and then
performs the following operations (e.g., data encryption) on
B. Security Proof for Privacy-Preserving SVM Training Algo- the IoT data. Since IoT providers and the data analyst usually
rithm have adequate computing resources, the experiments are run on
a PC equipped with a 4-core Intel i7 (i7-3770 64bit) processor
The roles involved in Algorithm 3 are n IoT data providers at 3.40GHz and 8 GB RAM, serving as IoT data providers and
P and an IoT data analyst C. Each IoT data providers an IoT data analysts simultaneously. We have implemented the
behaves in the same way. If we can prove that one of Secure Polynomial Multiplication, Secure Comparison, and
them is meets the security requirements, then every data the SecureSVM in Java Development Kit 1.8.
provider meets the security requirements. The function is F :
Dataset. To implement the method for this research, we
F (DPi , PKC , SKC ) = (φ, (w∗ , b∗ )).
use two real-world datasets, namely Breast Cancer Wisconsin
Proof of Proposition 2: Each IoT data provider’s view is
π Data Set (BCWD) [31] and Heart Disease Data Set (HDD)
viewA = (DPi , [[wt ]]C , PKC ). [[wt ]]C is encrypted by PKC ,
[32], which are publicly available from UCI machine learning
and the confidentiality of [[wt ]]C is equivalent to the used
repository. The features of BCWD are computed from a
cryptosystem Paillier. So each IoT data provider cannot infer
digitized image of a fine needle aspirate of a breast mass
the value directly. Hence, the simulator SPπ (DPi ; F (a, 1)) =
and describe characteristics of the cell nuclei present in the
viewPπ (DPi , [[wt ]]C , PKC ).
m image. Each instance is labeled as benign or malignant. The
π
P
IoT data analyst’s view is viewC = ( I| (wxi + b) < HDD consists of 13 numeric attributes, and all the instance
i=1
m
P are classified by the types of heart diseases. The statistics are
1|×xi yi , max {0, 1 − yi (wxi − b)} , wt , PKB , SKB ). Now, shown in Table III. We show the average results of cross-
i=1
m
P validation of 10 runs to avoid overfitting or contingent results.
we need to discuss the confidentiality of I| (wxi + b) < In each cross-validation, we randomly take 80% to train the
i=1
m
P model, and the remainder for testing.
1| × xi yi and max {0, 1 − yi (wxi − b)}; that is, whether Float Format Conversion. The general SVM training
i=1
IoT data analyst can guess the sensitive information of each algorithms performs on floating point numbers. However, the
IoT data provider from the two equations. operations of cryptosystem are carried out on integers. In
Obviously, both of the equations are no-solution equations order to encrypted data taking real values, it is necessary
for the unknown xi and yi . Except for brute force cracking, to previously perform a format conversion into an integer
there is no better way to get the real value of dataset D. We representation. Binary floating point number D is expressed
s
assume that each IoT data provider with a small dataset has as D = (−1) × M × 2E in the international standard IEEE
2-dimensional 100 instances, and each dimension is 32 bits3 . 754, where s is the sign bit, M is a significant number, and
Under this circumstance, the probability that IoT data analyst E is the exponent bit. In our implementation of secureSVM,
1 we employ it to perform the format conversion.
guesses success is 2n∗6400 . It is a negligible probability of
success [10]. Key Length Setting. In public-key cryptosystems, the
As Algorithm 2 and secure polynomial multiplication used length of keys is closely related to the security of the cryp-
in Algorithm 3 are secure in the honest-but-curious model, we tosystem, and short key may lead to an unsecure encryption. In
obtain the security of Algorithm 3 using modular sequential particular, homomorphic operations (i.e., the secure polynomi-
composition. al multiplication) runs on the ciphertext, a long key reduce the
efficiency of the homomorphic operation, and a too short key
VII. P ERFORMANCE E VALUATION may cause the plaintext space to overflow. Therefor we must
In this section, we evaluate the performance of secureSVM consider the key length to avoid the possibility of overflow. In
in terms of accuracy and efficiency through extensive ex- secureSVM, Paillier’s key N is set to 1024-bit.
periments using real-world datasets. We first describe the
experiment settings and then exhibit the experimental results B. Accuracy
to demonstrate its effectiveness and efficiency. We adopt two commonly used criterions for evaluating ML
3 Typically, single-precision floating-point occupies 4 bytes (32-bit) memory classifiers. Given a dataset for validation, the precision P is
space calculated as P = tp /(fp + tp ), and the recall R is calculated
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2901840, IEEE Internet of
Things Journal
TABLE V
P ERFORMANCE OF THE B UILDING B LOCKS IN SECURE SVM
TABLE IV
S UMMARY OF ACCURACY P ERFORMANCE Thus, the P time exhibits in Table V is the accumulation of
time spent by several P. In actual application, we could set
Precision Recall that several P general running algorithms in parallel, so that
Datasets
secureSVM SVM secureSVM SVM the time consumption of P and the total time consumption
BCWD 90.35% 90.47% 96.19% 97.24% can be decreased sharply. In order to better demonstrate the
HDD 93.89% 93.35% 89.78 90.87% performance of secureSVM in terms of time consumption,
we just show the raw running time. We believe Algorithm 3
to be practical for sensitive applications.
as R = tp /(fn + tp ), where tp is the numbers of relevant
(i.e., the positive class) that are classified correctly, fp is Facing with the different types of databases, a all numerical
the numbers of irrelevant (i.e., the negative class) that are attributes dataset BCWD or a all discrete attributes dataset
classified correctly and fn is the numbers of relevant that are HDD, secureSVM shows good robustness in terms of time
classified incorrectly in the test results. consumption.
To illustrate that secureSVM does not reduce the accuracy
upon protecting the privacy of each IoT data provider and Scalability Evaluation. secureSVM assumes that there
securely training classifiers, we implemented the general SVM are several IoT data providers participating and providing
with java named SVM. Because we focuses on how to securely date. To evaluate the scalability of our scheme, we divide
training a classifier in this paper, we do not adjust the the dataset into several equal parts to simulate the several
training parameters and use the default parameters. Table IV IoT data providers scenarios, and observe the changes in time
summarizes the results of precision and recall. consumption when different numbers of IoT data providers
Compared with the SVM, secureSVM has almost the same participate in the computation. We simulate the cases where
accuracy as SVM, which does not reduce the accuracy of the number of IoT data providers increases from 1 to 5. The
classifiers. BCWD is a dataset with all numerical attributes, results are displayed in Figure 3, the abscissa presents the
and HDD is a dataset with all discrete attributes. Our scheme number of IoT data providers involved in the computation,
shows good robustness on both types of datasets. and the ordinate is the time consumption.
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2901840, IEEE Internet of
Things Journal
10
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/JIOT.2019.2901840, IEEE Internet of
Things Journal
11
Meng Shen (M’14) received the B.Eng degree from Shandong University,
Jinan, China in 2009, and the Ph.D degree from Tsinghua University, Beijing,
China in 2014, both in computer science. Currently he serves in Beijing
Institute of Technology, Beijing, China, as an associate professor. His research
interests include privacy protection for cloud and IoT, blockchain applications,
and encrypted traffic classification. He received the Best Paper Runner-Up
Award at IEEE IPCCC 2014. He is a member of the IEEE.
Xiangyun Tang received the B.Eng degree in computer science from Minzu
University of China, Beijing, China in 2016. Currently she is a Ph.D student
in the Department of Computer Science, Beijing Institute of Technology.
Her research interests include Differential Privacy and Secure Multi-party
Computation.
2327-4662 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more information.