Professional Documents
Culture Documents
Decision Tree Classifier On Private Data
Decision Tree Classifier On Private Data
Abstract ice and Bob, each having a private data set, want to
conduct data mining on the joint data set that is the
This paper studies how to build a decision tree clas- union of all individual data sets; however, because of
sifier under the following scenario: a database is ver- the privacy constraints, no party wants to disclose its
tically partitioned into two pieces, with one piece private data set to each other or any third party. The
owned by Alice and the other piece owned by Bob. objective of this research is to develop efficient meth-
Alice and Bob want to build a decision tree classi- ods that enable this type of computation while min-
fier based on such a database, but due to the privacy imizing the amount of the private information each
constraints, neither of them wants to disclose their party has to disclose to the other. The solution to
private pieces to the other party or to any third party. this problem can be used in situations where data
We present a protocol that allows Alice and Bob privacy is a major concern.
to conduct such a classifier building without having to Data mining includes various algorithms such as
compromise their privacy. Our protocol uses an un- classification, association rule mining, and clustering.
trusted third-party server, and is built upon a useful In this paper, we focus on classification. There are
building block, the scalar product protocol. Our so- two types of classification between two collaborative
lution to the scalar product protocol is more efficient parties: Figure 1.a (suppose each row represents a
than any existing solutions. record in the data set) shows the data classification
on the horizontally partitioned data, and Figure 1.b
Keywords: Privacy, decision tree, classification. shows the data classification on the vertically parti-
tioned data. To use the existing data classification al-
1 Introduction gorithms, on the horizontally partitioned data, both
parties needs to exchange some information, but they
Success in many endeavors is no longer the result of do not necessarily need to exchange each single record
an individual toiling in isolation; rather success is of their data sets. However, for the vertically parti-
achieved through collaboration, team efforts, or part- tioned data, the situation is different. A direct use
nerships. In the modern world, collaboration is im- of the existing data classification algorithms requires
portant because of the mutual benefit it brings to the one party to send its data (every record) to the other
parties involved. Sometimes, such collaboration even party, or both parties send their data to a trusted cen-
occurs between competitors, mutually untrusted par- tral place (such as a super-computing center) to con-
ties, or between parties that have conflicts of interests, duct the computation. In situations where the data
but all parties are aware that the benefit brought by records contain private information, such a practice
such collaboration will give them the advantage over will not be acceptable.
others. For this kind of collaboration, data privacy In this paper, we study the classification on the
becomes extremely important: all parties of the col- vertically partitioned data, in particular, we study
laboration promise to provide their private data to how to build a decision tree classifier on private (ver-
the collaboration, but none of them wants the others tically partitioned) data. In this problem, each single
or any third party to learn much about their private record is dividied into two pieces, with Alice know-
data. ing one piece and Bob knowing the other piece. We
Currently, to solve the above problem, a com- have developed a method that allows them to build a
monly adopted strategy is to assume the existence decision tree classifier based on their joint data.
of a trusted third party. In today’s dynamic and
sometimes malicious environment, making such an as- 2 Related Work
sumption can be difficult or in fact not feasible. Solu-
tions that do not assume the third party are greatly Privacy Preserving Data Mining. In early work
desired. on such privacy preserving data mining problem Lin-
In this paper, we study a very specific collabora- dell and Pinkas (Lindell & Pinkas 2000) propose a
tion, the data mining collaboration: two parties, Al- solution to the privacy preserving classification prob-
∗
Portions of this work were supported by Grant ISS-0219560 lem using the oblivious transfer protocol, a powerful
from the National Science Foundation, and by the Center for Com- tool developed by the secure multi-party computa-
puter Application and Software Engineering (CASE) at Syracuse tion studies. The solution, however, only deals with
University. the horizontally partitioned data, and targets only for
c
Copyright
2002, Australian Computer Society, Inc. This pa-
per appeared at IEEE International Conference on Data Min- the ID3 algorithm (because it only emulates the com-
ing Workshop on Privacy, Security, and Data Mining, Maebashi putation of the ID3 algorithm). Another approach
City, Japan. Conferences in Research and Practice in Informa- for solving the privacy preserving classification prob-
tion Technology, Vol. 14. Chris Clifton and Vladimir Estivill- lem was proposed by Agrawal and Srikant (Agrawal
Castro, Eds. Reproduction for academic, not-for profit purposes & Srikant 2000) and also studied in (Agrawal &
permitted provided this text is included.
Data Mining Data Mining
Private Data
Private Data
Private Data
Private Data
Aggarwal 2001). In this approach, each individual set to build a model of the class label such that it can
data item is perturbed and the distributions of the all be used to classify new data whose class labels are
data is reconstructed at an aggregate level. The tech- unknown.
nique works for those data mining algorithms that Many types of models have been built for classi-
use the probability distributions rather than individ- fication, such as neural networks, statistical models,
ual records. An example of classification algorithm genetic models, and decision tree models. The deci-
which uses such aggregate information is also dis- sion tree models are found to be the most useful in
cussed in (Agrawal & Srikant 2000). the domain of data mining since they obtain reason-
There has been research considering preserving able accuracy and they are relatively inexpensive to
privacy for other type of data mining. For instance, compute. Most decision-tree classifiers (e.g. CART
Vaidya and Clifton proposed a solution (Vaidya & and C4.5) perform classification in two phases: Tree
Clifton 2002) to the privacy preserving distributed Building and Tree Pruning. In tree building, the de-
association mining problem. cision tree model is built by recursively splitting the
training data set based on a locally optimal criterion
Secure Multi-party Computation. The prob- until all or most of the records belonging to each of
lem we are studying is actually a special case of a the partitions bear the same class label. To improve
more general problem, the Secure Multi-party Com- generalization of a decision tree, tree pruning is used
putation (SMC) problem. Briefly, a SMC problem to prune the leaves and branches responsible for clas-
deals with computing any function on any input, in sification of single or very few data vectors.
a distributed network where each participant holds
one of the inputs, while ensuring that no more in- 3.2 Problem Definition
formation is revealed to a participant in the compu-
tation than can be inferred from that participant’s We consider the scenario where two parties, Alice and
input and output (Goldwasser 1997). The SMC prob- Bob, each having a private data set (denoted by Sa
lem literature is extensive, having been introduced by and Sb respectively), want to collaboratively conduct
Yao (Yao 1982) and expanded by Goldreich, Micali, decision tree classification on the union of their data
and Wigderson (Goldreich, Micali & Wigderson 1987) sets. Because they are concerned about the data’s pri-
and others (Franklin, Galil & Yung 1992). It has been vacy, neither party is willing to disclose its raw data
proved that for any function, there is a secure multi- set to the other. Without loss of generality, we make
party computation solution (Goldreich 1998). The the following assumptions on the data sets Sa and Sb
approach used is as follows: the function F to be (the assumptions can be achieved by pre-processing
computed is first represented as a combinatorial cir- the data sets Sa and Sb , and such pre-processing does
cuit, and then the parties run a short protocol for not require one party to send its data set to the other
every gate in the circuit. Every participant gets cor- party):
responding shares of the input wires and the output
wires for every gate. This approach, though appeal- 1. Sa and Sb contain the same number of data
ing in its generality and simplicity, means that the records. Let N represent the total number of data
size of the protocol depends on the size of the cir- records.
cuit, which depends on the size of the input. This 2. Sa contains some attributes for all records, Sb
is highly inefficient for large inputs, as in data min- contains the other attributes. Let n represent
ing. It has been well accepted that for special cases of the total number of attributes.
computations, special solutions should be developed 3. Both parties share the class labels of all the
for efficiency reasons. records and also the names of all the attributes.
Problem 3.1 (Privacy-preserving Classifica-
3 Decision-Tree Classification Over Private tion Problem) Alice has a private training data set
Data Sa and Bob has a private training data set Sb , data
set [Sa ∪ Sb ] is the union of Sa and Sb (by vertically
3.1 Background putting Sa and Sb together so that the concatenation
Classification is an important problem in the field of of the ith record in Sa with the ith record in Sb
data mining. In classification, we are given a set of becomes the ith record in [Sa ∪ Sb ]). Alice and Bob
example records, called the training data set, with want to conduct the classification on [Sa ∪ Sb ] and
each record consisting of several attributes. One of finally build a decision tree model which can classify
the categorical attributes, called the class label, in- the new data not from the training data set.
dicates the class to which each record belongs. The
objective of classification is to use the training data
3.3 Tree Building 6. Find the best split among these B[i]’s
3.3.1 The Schemes Used To Obtain The In- 7. Use the best split to split node  into Â1 ,
formation Gain Â2 , . . . , Âk
There are two main operations during tree building: 8. For i = 1, . . . , k, add Âi to Q if Âi is not well
(1) evaluation of splits for each attribute and selection classified
of the best split and (2) creation of partitions using 9. }
the best split. Having determined the overall best
split, partitions can be created by a simple application 3.3.4 How to Find the Best Split For At-
of the splitting criterion to the data. The complexity tribute Â
lies in determining the best split for each attribute.
A splitting index is used to evaluate the “good- For the simplicity purpose, we only describe the En-
ness” of the alternative splits for an attribute. Sev- tropy splitting scheme, but our method is general, and
eral splitting schemes have been proposed in the can be used for Gini Index and other similar splitting
past (Rastogi & Shim 2000). We consider two com- schemes.
mon schemes: the entropy and the gini index . If To find the best split for each node, we need to
a data set S contains examples from m classes, the find the largest information gain for each node. Let S
Entropy(S) and the Gini(S) are defined as follow- represent the set of the data belonging to the current
ings: node Â. Let R represent the set of requirements that
each record in the current node has to satisfy.
X
m
To evaluation the information gain for the at-
Entropy(S) = − Pj log Pj (1) tribute B[i], if all the attributes involved in R and
j=1 B[i] belong to the same party, this party can find the
X
m information gain for B[i] by itself. However, except
Gini(S) = 1 − Pj2 (2) for the root node, it is unlikely that R and B[i] be-
long to the same party; therefore, neither party can
j=1
compute the information gain by itself. In this case,
where Pj is the relative frequency of class j in S. we are facing two challenges: one is to compute En-
Based on the entropy or the gini index, we can com- tropy(S), and the other is to compute Gain(S, B[i])
pute the information gain if attribute A is used to for each candidate attribute B[i].
partition the data set S: First, let us compute Entropy(S). We divide R into
two parts, Ra and Rb , where Ra represents the sub-
X |Sv | set of the requirements that only involve Alice’s at-
Gain(S, A) = Entropy(S) − ( ∗ Entropy(Sv )) tributes, and Rb represents the subset of requirements
|S|
v∈A that only involve Bob’s attributes. For instance, in
(3) Figure 2, suppose that S contains all the records
X |Sv | whose Outlook = Rain and Humidity = High, so Ra
Gain(S, A) = Gini(S) − ( ∗ Gini(Sv )) (4) = {Outlook = Rain} and Rb = {Humidity = High}.
|S|
v∈A Let Va be a vector of size N . Va (i) = 1 if the ith
where v represents any possible values of attribute A; record satisfies Ra ; Va (i) = 0 otherwise. Because Ra
Sv is the subset of S for which attribute A has value belongs to Alice, Alice can compute Va from her own
v; |Sv | is the number of elements in Sv ; |S| is the share of attributes. Similarly, let Vb be a vector of
number of elements in S. size N . Vb (i) = 1 if the ith data item satisfies Rb ;
Vb (i) = 0 otherwise. Bob can compute Vb from his
own share of attributes. Let Vj be a vector of size N ,
3.3.2 Terminology and Notation Vj (i) = 1 if the ith data item belongs to class j; Vj (i)
For convenience, we define the following notations: = 0 otherwise. Both Alice and Bob can compute Vj .
Notes that a nonzero entry of V = Va ∧ Vb (i.e.
1. Let Sa represent Alice’s data set; and let Sb rep- V (i) = Va (i) ∧ Vb (i) for i = 1, . . . , N ) means the cor-
resent Bob’s data set. responding record satisfies both Ra and Rb , thus be-
2. We say node  is well classified if  contains longing to partition S. To build decision trees, we
records that belong to the same class; we say  need to find out how many entries in V are non-zero.
This is equivalent to computing the scalar product of
is not well classified if  contains records that do Va and Vb :
not belong to the same class.
3. Let B represent an array that contains all of Alice X
N
and Bob’s attributes and B[i] to represent the ith Va · Vb = Va (i) ∗ Vb (i) (5)
attribute in B. i=1
3.3.3 Decision Tree Building Procedure However, Alice should not disclose her private data
Va to Bob, neither should Bob. We have developed
The following is the procedure for building a decision a Scalar Product protocol (Section 4) that enables
tree on (Sa , Sb ). Alice and Bob to compute Va · Vb without sharing
1. Alice computes information gain for each at- information between each other.
tribute of Sa . Bob computes information gain After getting Va , Vb , and Vj , we can now compute
for each attribute of Sb . Initialize the root to be P̂j , where P̂j is the number of occurrences of class j
the attribute with the largest information gain in partition S.
2. Initialize queue Q to contain the root
3. While Q is not empty do { P̂j = Va · (Vb ∧ Vj ) = (Va ∧ Vj ) · Vb (6)
4. Dequeue the first node  from Q
5. For each attribute B[i] (for i = 1, . . . , n),
evaluate splits on attribute B[i]
P
After knowing P̂j , we can compute |S| = nj=1 P̂j
and Pj , the relative frequency of class j in S:
Va (Rain) = (0, 0, 1, 1, 1)T
P̂j
Pj = (7) Va (Rain N o) = (0, 0, 0, 0, 1)T
|S|
Va (Rain Y es) = (0, 0, 1, 1, 0)T
Therefore, we can compute Entropy(S) using
Equation (1). We repeat the above computation for
partition Sv to obtain |Sv | and Entropy(Sv ) for all Bob computes Vb (High) vector, where he sets the
values v of attribute B[i]. We then compute Gain(S, value to 1 if the value of attribute Humidity is High;
B[i]) using Equation (3). he sets the value to 0 otherwise. Similarly Bob com-
The above process will be repeated until we get putes Vb (N ormal) vector, where he sets the value to
the information gain for all attributes B[i] for i = 1 if the value of attribute Humidity is Normal; he sets
1, . . . , n. Finally, we choose attribute B[k] for the par- the value to 0 otherwise. Therefore, Bob obtains:
tition attribute for the current node of the tree, where
Gain(S, B[k]) = max{Gain(S, B[1]), . . . , Gain(S, Vb (High) = (1, 1, 1, 0, 0)T
B[n])}.
We use an example to illustrate how to calculate
the information gain using entropy (using gini index Vb (N ormal) = (0, 0, 0, 1, 1)T
is similar). Figure 2 depicts Alice’s and Bob’s raw
data, and S represents the whole data set.
Alice can compute Gain(S, Outlook):
Gain(SRain , Humidity) = Entropy(SRain )
Entropy(S)
|Sv=High |
− |SRain | Entropy(SRain , High)
= − 35 log( 53 ) − 2
5 log( 25 )
− |Sv=N ormal |
|SRain | Entropy(SRain , N ormal)
= 0.971
= 0.42
In order to compute Entropy(SRain ,High), Al-
ice and Bob has to collaborate because Alice knows
Alice and Bob can exchange these values and se- SRain , but knows nothing about whether the Humid-
lect the biggest one as the root. Suppose they choose ity is High or not, while Bob knows the Humidity
Outlook as the root node, we will show how to com- attribute, but knows nothing about which record be-
pute Gain(SRain , Humidity) in order to build the sub long to SRain . In what follows, P̂N o is the number of
trees. Where SRain is the set of data whose values of occurrences of class No in partition SRain when Hu-
Outlook attribute are Rain. midity = High; P̂Y es is the number of occurrences of
Alice computes vector Va (Rain) where she sets the class Yes in partition SRain when Humidity = High.
value to 1 if the value of attribute outlook is Rain;
she sets the value to 0 otherwise. She then computes
vector Va (Rain N o), where she sets the value to 1 if
the value of attribute outlook is Rain and the class
label is No; she sets the value to 0 otherwise. Simi-
larly, she computes vector Va (Rain Y es), where she
sets the value to 1 if the value of attribute outlook is
Rain and the class label is Yes; she sets the value to
0 otherwise. Therefore, Alice obtains:
Alice Bob
Sunny Rain
Humidity Wind
No Yes Yes No
5 Privacy Improvement can also be achieved by using the scalar product pro-
tocol:
5.1 Computing Gini Index Xm
(aj + bj )(a0j + b0j )
Problem 5.1 (Gini Index) Alice has a sequence j=1
a1 , . . . , am , Bob has a sequence b1 , . . . , bm , and Pj =
aj + bj for j = 1, . . . , m. Without disclosing each
party’s private inputs to the other, they want to com- 6 Information Disclosure Analysis
pute
Xm Information disclosure of privacy preserving data
Gini(S) = 1 − (aj + bj )2 . mining could come from two sources: one is the disclo-
j=1 sure caused by the algorithm, the other is the disclo-
sure caused by the result. The second type is inherent
Because of the following equations, the gini index to the privacy preserving data mining problem. We
can be actually computed using the Scalar Product have analyzed the first part of information disclosure
protocol. when we discuss about our solution. Here, we discuss
the second type of information disclosure. For that
Pm 2
Pm 2 Pm 2 we assume the underlying algorithm itself is perfectly
j=1 (aj + bj ) = j=1 aj + j=1 bj secure and discloses no information.
+(a1 , . . . , am ) · (b1 , . . . , bm ) Assume the following decision tree model (Fig-
ure 3) is what we get after the decision tree building.
5.2 Computing Entropy We assume this model is 100% accurate. Now given a
data item T1 ( Alice:(Sunny, No), Bob:(High, Weak,
Problem 5.2 (Entropy) Alice has a sequence No)), we will show how Bob can figure out Alice’s
a1 , . . . , am , Bob has a sequence b1 , . . . , bm , and private data.
Pj = aj + bj for j = 1, . . . , m. Without disclosing From the model, the only path that matches Bob’s
each party’s private inputs to the other, they want part of the data is the leftmost path. It is No→
to compute High → Humidity→Sunny → Outlook. Therefore,
Bob knows that Alice’s value of outlook attribute is
X
m
Sunny.
Entropy(S) = − (aj + bj ) log(aj + bj ) In practice, decision tree for the classification is
j=1 much more complex, the tree is much deeper, and the
accuracy of the model is not perfect. It is not as easy
To compute the entropy, the difficulty is how to to guess the other party’s private data as in this case.
compute log(aj +bj ). First, let us look at the following In the next step of our research, we plan to investigate
Logarithm problem: how much information could be disclosed in practice
from the final decision tree.
Problem 5.3 (Logarithm) Alice has a number a,
and Bob has a number b. They want to compute
log(a + b), such that Alice (only Alice) gets a0 , and 7 Conclusion and Future Work
Bob (only Bob) gets b0 , where b0 is randomly gener-
ated and a0 + b0 = log(a + b). Nobody should be able Classification is an important problem in data min-
to learn the other party’s private input. ing. Although classification has been studied exten-
sively in the past, the various techniques proposed
To solve this problem, Bob can generate a random for classification do not work for situations where the
positive number r in a finite real domain, and let b0 = data are vertically partitioned: one piece is known
− log r. Now Alice and Bob use the scalar product by one party, the other piece by another party, and
protocol to compute r(a + b), such that only Alice neither party wants to disclose their private pieces
knows r(a + b). Because of r, Alice does not know b. to the other party. We presented a solution to this
Then Alice can compute log r(a + b) = log r + log(a + problem using a semi-trusted commodity server. We
b) = a0 ; therefore a0 + b0 = log(a + b). discussed the security of our solution and the possible
Let a0j + b0j = log(aj + bj ), where a0j and b0j are security problem inherent in the decision tree classi-
generated using the above Logarithm protocol. Now fication method.
Alice and Bob need to compute the following, which One thing that we plan to investigate is how
much information about the vertically partitioned
data could be disclosed if both parties know the final
decision tree. This is very important to understand
the limit of the privacy preserving classification. We
will also investigate other decision tree building algo-
rithms, and see whether our techniques can be applied
to them or not.
References
Agrawal, D. & Aggarwal, C. (2001), On the design and quan-
tification of privacy preserving data mining algorithms,
in ‘Proccedings of the 20th ACM SIGACT-SIGMOD-
SIGART Symposium on Principles of Database Systems’,
Santa Barbara, California, USA.
Agrawal, R. & Srikant, R. (2000), Privacy-preserving data min-
ing, in ‘Proceedings of the 2000 ACM SIGMOD on Man-
agement of Data’, Dallas, TX USA, pp. 439–450.
Atallah, M. J. & Du, W. (2001), Secure multi-party computa-
tional geometry, in ‘WADS2001: 7th International Work-
shop on Algorithms and Data Structures’, Providence,
Rhode Island, USA, pp. 165–179.
Beaver, D. (1997), Commodity-based cryptography (extended
abstract), in ‘Proceedings of the 29th Annual ACM Sym-
posium on Theory of Computing’, El Paso, TX USA.
Beaver, D. (1998), Server-assisted cryptography, in ‘Proceed-
ings of the 1998 New Security Paradigms Workshop’,
Charlottesville, VA USA.
Di-Crescenzo, G., Ishai, Y. & Ostrovsky, R. (1998), Univer-
sal service-providers for database private information re-
trieval, in ‘Proceedings of the 17th Annual ACM Sympo-
sium on Principles of Distributed Computing’.
Du, W. (2001), A Study of Several Specific Secure Two-party
Computation Problems, PhD thesis, Purdue University,
West Lafayette, Indiana.
Du, W. & Zhan, Z. (2002), A practical approach to solve se-
cure multi-party computation problems, in ‘Proceedings
of New Security Paradigms Workshop (to appear)’, Vir-
ginia Beach, virginia, USA.
Franklin, M., Galil, Z. & Yung, M. (1992), An overview of se-
cure distributed computing, Technical Report TR CUCS-
00892, Department of Computer Science, Columbia Uni-
versity.
Goldreich, O. (1998), ‘Secure multi-party com-
putation (working draft)’, Available from
http://www.wisdom.weizmann.ac.il/home/oded/ pub-
lic html/foc.html.
Goldreich, O., Micali, S. & Wigderson, A. (1987), How to play
any mental game, in ‘Proceedings of the 19th Annual
ACM Symposium on Theory of Computing’, pp. 218–229.
Goldwasser, S. (1997), Multi-party computations: Past and
present, in ‘Proceedings of the 16th Annual ACM Sym-
posium on Principles of Distributed Computing’, Santa
Barbara, CA USA.
Lindell, Y. & Pinkas, B. (2000), Privacy preserving data min-
ing, in ‘Advances in Cryptology - Crypto2000, Lecture
Notes in Computer Science’, Vol. 1880.
Rastogi, R. & Shim, K. (2000), ‘(public): A decision tree clas-
sifier that integrates building and pruning’, Data Mining
and Knowledge Discovery 4(4), 315–344.
Vaidya, J. & Clifton, C. (2002), Privacy preserving association
rule mining in vertically partitioned data, in ‘Proceedings
of the 8th ACM SIGKDD International Conference on
Knowledge Discovery and Data Mining’.
Yao, A. C. (1982), Protocols for secure computations, in ‘Pro-
ceedings of the 23rd Annual IEEE Symposium on Foun-
dations of Computer Science’.