Professional Documents
Culture Documents
Distributed Decision Tree Learning in Peer-to-Peer Environments Using A Randomized Approach
Distributed Decision Tree Learning in Peer-to-Peer Environments Using A Randomized Approach
3 Related Work
2
cision trees
n n−1 n−1−2
=N = × .2! × .(22 )! × . . .
1 2 22
n − 1 − 2 − . . . − 2k−1
... × .(2k !)
2k
k
n − 2i + 1
Y n!
= i
= k+1 + 1)!
i=0
2 (n − 2
3
techniques [2], the individual classifiers are con-
Now, probability that all of the l essential fea- structed from the random samples taken with
tures are chosen in a given path (schemata) in Ti replacement from the population, but in case of
(n−l) RDT construction, the entire population is used for
will be = P (Si ) = k−l .
(nk) the same, hence strength of the individual classifiers
Hence, the probability that Ti classifies x cor- must be higher.
2k
[
rectly = P (Cx ) = P (Si ) = p (let), which is 4.2 Accuracy of the ensemble classi-
i=1 fier (the population mean)
similar to the “strength” of the individual classifiers
defined by Breiman et al. [3], following which the As explained in the earlier section, for a given test
strength of the ensemble of RDTs can be defined in tuple x, we can consider predicting the target class
terms of the margin function as of x an RDT classifier as a Bernoulli trial (a coin
toss) with success probability (probability to classify
s = Ex,y mrdt(x, y) x correctly) as p.
where
Let Yi denote the random variable denoting
mrdt(x, y) = P (T (x) = y) − maxP (T (x) = j) the output of each trial with
j6=y
where
(
1 x correctly classified by Ti
Yi =
T ∈ {T1 , T2 , . . . , TN }, set of RDT classifiers. 0 x wrongly classified by Ti
PN
Also, since probability that Ti classifies x correctly Then Y = i=1 Yi ∼ B(N, p) follows a Binomial
= p and wrongly with probability = q = 1 − p, we distribution (population size = number of RDTs of
can think of N independent Bernoulli trials (one depth k is N ), with (population) mean N p.
for each RDT) with probability of success (i.e.,
probability of correct classification) = p. The aggregate predictor (for the ensemble of
the RDT classifiers) can be computed based on
Now, P (T (x) = y) is the proportion of the majority voting as [2]
trees expected to vote correctly ⇒ P (T (x) = y) =
φA (x) = argmax Q(j|x)
E[I(T (x) = y)] = p and P (T (x) 6= y) = q, for binary j
classification, we have
or equivalently by averaging the individual posterior
1 probabilities given x (for all N RDTs in the popula-
mrdt(x, y) = p − q = 2. p −
2 tion of all possible depth k binary decision trees) as
(population mean) [4]
1
s = Ex,y mrdt(x, y) = 2. p −
2 N
!
1 X
Pi (y|x)
It can be easily seen that the strength (s) of the N i=1
individual RDT classifiers will be high in general
(when we have p > 12 ) and will keep on increasing The aggregate classifier gives the correct output
rapidly as the depth k of the generated RDTs iff majority (b N2 c + 1) of the N individual binary
increases. classifiers agree on their vote for the target class
label for the test tuple x correctly.
Additionally, we notice that in Bagging like
4
Hence, the aggregate classifier gives the correct shown.
output with probability
The population mean class distribution vector
N is obtained by averaging the class distributions of all
= P (Y ≥ + 1) X
2 N possible RDTs of depth k by µ = N 1
N Pi (y|x).
N i=1
= 1 − P (Y ≤ + 1) Similarly, the sample mean class distribution ob-
2
tained by averaging the class distributions of all r
N X
=1−F + 1; N, p possible RDTs of depth k by µ̂ = N1 rPi (y|x).
2 i=1
2 !
1 N p − N2 + 1
Now, since the sample RDTs are independently
≥ 1 − exp −
2p N chosen, by the Central Limit Theorem, we have
= 1 − pe µ − µ̂ D
√ → ℵ(0, 1)
2
σ/ r
where F (k; N, p) ≤ exp − 21p (N p−k) by Chernoff’s
N µ − µ̂ µ − µ̂
P √ < = 1 − 2Φ √
bound. σ/ r σ/ r
Also, by Chebyshev’s inequality, we have
Since pe is likely to be small we have a high
Pr(|µ − µ̂| ≥ √σr ) ≤ r. These inequalities can
accuracy for the aggregate classifier and hence
taking the true “population mean” of all the RDTs be used to find minimum number (r) of RDTs
explains the reason for the high accuracy of the required to be generated to ensure that sample mean
ensemble classifier. does not deviate much from the population mean, as
described in the next section.
5
cision trees, i.e, P (fˆ(x) 6= y) → 0 as r → ∞. the random trees generated posses a schemata (a
In other words, if we have the ideal (optimal) de- path) that has X2 = 1 as a fixed bit (X2 was never
cision function f , then ˆ selected while construction of any of the random
r
! f converges to f a.s., i.e.,
X trees, hence all of them have X2 = ∗). In this
lim P wi fi → f = 1. case, we can’t say anything specific about the target
r→∞
i=1 class label depending upon the specific value of X2
from our ensemble classifier (although this is not
Proof: going to reduce the accuracy when X2 does not
have a major role in target classification, but in the
The generalization error for forests converges a.s. to
average case, if the feature X2 has contribution in
a limit as the number of trees in the forest becomes
target classification) and we treat this example to
large, which is ensured by SLLN ([3], theorem 1.2).
be an insufficient representation. In other words,
Also, the generalization error of a forest of tree clas-
we define the “sufficient representation” (this has
sifiers depends on the strength of the individual trees
nothing to do with the traditional machine learning
in the forest and the correlation between them ([3],
representation defined by [7]) by the notion that for
theorem 2.3). Since the attributes are chosen inde-
every attribute Xi , i = 1 . . . n, the class of r random
pendently, hence they are highly uncorrelated and
trees generated must have at least one schemata
the error will be upper bound by a small value.
(path) having a “fixed bit” corresponding to that
attribute.
4.4.2 Representing the decision trees: the
“sufficient representation” and the Definition 1
“fixed bit” heuristic
A collection of random decision trees {fi : i = 1 . . . r}
Random Decision Trees can be thought of as func-
provides a “sufficient representation” of the under-
tions {fi }ri=1 with fi : X n → Di , ∀i, where Di is
lying domain iff for each feature Xi , i = 1 . . . n, it
the probability distribution of the class lables. We
contains at least one schemata with a “fixed bit”
shall represent our final classifier function fˆ as a lin-
corresponding to that feature.
ear (convex) combination of all random decision tree
Xr
functions, i.e., fˆ = wi fi , with properly chosen Lemma 2
i=1
weights wi , which can be chosen uniformly (simple The minimum number (rmin ) of random decision
averaging), by majority (like bagging) or by greedy trees required to ensure representation sufficiency to
boosting techniques. log ( 1−δ )
a degree of δ% confidence is given by log np , where
Since there are n different features and each can 2 2k−1
have 3 possible values in a given schema (namely 0, p = 1 − n1 1 − n−1 1 1
. . . 1 − n−k+1 .
1 or *), along with 2 different possible target class
labels (+ or -, since binary classifier), we have 3n .2 Proof
different concepts. Also, we need to have at least one
name for each concept and no two concepts can share Probability that a given feature Xi was never
a name, in order to have a “representation” for the selected while building a given random tree
class of concepts [7]. Hence we need to have at least 2 2k−1
3n .2 different names, in order to have a “representa- = p = 1 − n1 1 − n−1
1 1
. . . 1 − n−k+1 .
tion” of the class of concepts, which is exponential in
number of features. Since the trees are constructed independently,
Let’s consider an arbitray schemata “∗1∗” for probability that a given feature Xi was never se-
k = 3 and also additionally assume that none of lected while building any of the r generated random
6
trees required to be generated in order to guaranty
a δ% confidence is
log ( 1−δ
n )
rmin = log p .
For instance, if we want δ = 0.9 ⇒ rmin ≥ 1+log(n)
log(1/p) ,
from which we can compute the minimum number
of random decision trees required (rmin ) to ensure
representation sufficiency.
Lemma 3
7
represented as f : {h1 , h2 , . . . h2k } → {0, 1}. Post’s Functional Completeness Theorem
Also, each (partial) schemata consists of any of This theorem defines 5 different classes Ci for boolean
the 2 fixed values (1, 0) or a variable value (∗) for a functions (β, γ, alternating, A : a, self − dual) and
given feature. Hence any arbitrary decision tree f states that a set F = {fi |i = 1 . . . r} of truth func-
can be represented by the linear combination of the tions (set of r RDTs) will be functionally complete
r random trees essentially means that any arbitrary iff ∀i ∈ {1 . . . 5}, ∃f ∈ F |f ∈
/ Ci .
schemata can be represented as linear combination
5
of the schematas present in the randomly generated \
P (F is functionally complete) = P ( (∃f ∈ F |f ∈
/ Ci ))
trees. We note that a single tree is enough to
i=1
represent all possible schemata.
5
Y 5
Y
= (1 − P (∀f ∈ F |f ∈ Ci )) = (1 − pi )
Definition 2
i=1 i=1
The family of functions {fi : i = 1 . . . r} are Since pi s are likely to be small, the set of random
functionally complete iff they form a linearly in- decision trees generated are likely to be functionally
dependent (basis) set of functions in the space of complete.
decision tree functions.
4.5 Random Decision Trees: Orthog-
Definition 3
onal?
The set of functions {fi : i = 1 . . . r} are linearly inde- We can compute the Fourier coefficients wj od a RDT
r
X function f by [6]
pendent iff ∀x ∈ X n , λi .fi (x) = 0 ⇒ λi = 0, ∀i.
i=1 1 X
wj = f (h)ψj (h)
We shall now be interested to answer the follow- |∧|
h∈∧
ing question: how many random decision trees are
needed to achieve functional completeness. We note As shown in the same paper [6], in a depth k RDT,
that by function completeness, we mean that any ar- all the Fourier coefficients of order ≥ k are zero.
bitrary decision tree function f should be expressed
as a linear combination of the random decision tree Also, wj 6= 0 ⇒ ∃ a schema h ∈ ∧ in the
function vectors (fi (x), i = 1 . . . r) only. RDT s.t. h has a fixed bit in all the positions where
wj has a 1 bit.
Let f : {0, 1, ∗}n → {0, 1} be our target class
function that we want to represent using linear com-
Let w be a Fourier Coefficient and o(w) = l.
bination of fi . Now, given any tuple x ∈ {0, 1, ∗}n ,
Now consider depth k binary RDT that has 2k
r
X different paths. Define IPi , i = 1 . . . 2k to be the
P (y = +|x) = f (x) = wi .fi (x) event that the schemata represented by Pi does not
i=1 fixed bits in all the positions where wj has set bits.
r
X
= wi .P (yi = +|x), ∀x ∈ {0, 1, ∗}n , wi ∈ [0, 1]. k
2
\
i=1
Hence, P (IPi ) = 1− nl ⇒ P (w 6= 0) = P IPi =
Hence we have 3n equations with r unknown vari-
i=1
ables (wi s) and we have to find the minimum value k
l 2
of r for which the simultaneous linear system of 1− n , which is very small.
equations is consistent.
Hence, most of the Fourier coefficients for the
8
RDTs are likely to be zeros, thereby yielding a 4.8 Training v.s Testing Error
very low inner-product between any two Fourier-
coefficient vectors, hence any two RDTs will be 4.9 Learning model with confidence
orthogonal to a high degree. Experimental results at least δ and error at most
show the inner product lies in between 0.1 − 0.2.
5 DRDT: Distributed exten-
sion of RDT
4.6 Correlation between the individ- 5.1 The Algorithm
ual RDTs in terms of the Raw The distributed extension of the RDT algorithm is
Margin Function pretty straight-forward. Instead of using the entire
training data (population) for the construction of the
4.7 Generalization Error of the En- individual RDTs, we use random samples (with re-
placement) as in [2], [3] to construct RDTs at differ-
semble Classifier
ent sites (nodes), one RDT per each node. For testing
purpose, a testing tuple is sent to all the sites where
From the analysis of the random forest paper [3], we the probability distribution vector for its class label
see that the generalization error is predicted by the corresponding RDT. The mean
vector of the probability distribution vectors is com-
puted by distributed average computation, in order
1 − s2
P E ∗ ≤ ρ̄ 2 to obtain the target class label.
s
Algorithm 1 DRDT: The Distributed Random De-
which is dependent upon cision Tree Ensemble Classification
1: for each node parallelly do
2: Randomly (with replacement) select N data
1. The mean value of the correlation coefficient be- tuples from the training data (population).
tween the features selected for each of the in- 3: Locally generate a random decision tree struc-
dividual classifiers (since the RDTs constructed ture of height k by using build RDT technique
are of small heights (shallow) [4], ρ̄ will obviously by Fan et al.
be low, thereby decreasing the lower bound of 4: Locally compute leaf node class distribution for
the generalization error, as shown previously. the RDT using ComputeStatistics technique by
Fan et al.
5: end for
6: Given a test tuple x, send x to every node.
2. The strength s of the individual random decision 7: At each node parallelly compute the posterior
tree classifiers, which is high enough (as per the class probability distribution vector locally.
lower bound proof shown in the earlier section). 8: Use Distributed Mean to compute the average
vector of all the locally computed class distribu-
tion vectors, to obtain the target class probability
As discussed in [3], since the growing depth of the distribution.
classifiers (individual RDTs) cause increase in the 9: Output the target class label corresponding to the
strength and increase in correlation simultaneously, highest probability in the mean vector computed.
this is a tradeoff. The half-full heuristic proposed by
Fan et al. gives good results as explained in [4].
9
5.2 Analysis and hence it makes sense to assign weights to the
individual classifiers prediction, in order to improve
Similar to analysis in section 4. accuracy of the ensemble classifier.
r
!
X
min P (y = +|x) − wi .P (yi = +|x)
w
i=1
s.t. x ∈ {0, 1, ∗}n , wi ∈ [0, 1]
10
6.3 The Distributed Algorithm
Algorithm 2 Distributed AdaBoost Technique
1: Horizontally partition the data set Dm×n into N
N
7 Conclusion i
partitions Dm , D =
[
Di and assign ith par-
i ×n
i=1
References N
X
tition to node ℵi , (where m = mi ).
[1] Y. Amit and D. Geman. Shape quantization and i=1
recognition with randomized trees. Neural Com- 2: At each node i, generate the same set of r random
putation, 9(7):1545–1588, 1997. decision trees (Ti1 , . . . , Tir ) using same (psuedo)
random generator (with same seed) and compute
[2] L. Breiman. Bagging predictors. Machine Learn- the leaf probability distributions locally.
ing, 24(2):123–140, 1996. 3: Combine the leaf probability distributions (as
distributed sum) of the locally generated trees
[3] L. Breiman. Random forests. Machine Learning, to obtain the global decision trees (h1 , . . . , hr )
45(1):5–32, 2001. that are to be there in each of the nodes and
[4] W. Fan, H. Wang, P. S. Yu, and S. Ma. Is random the week classifiers are to be chosen by AdaBoost
model better? on its accuracy and efficiency. In from these trees.
1
ICDM, pages 51–58, 2003. 4: D1 (i) = m , intial distribution (assuming each
node knows the total number of tuples m).
[5] Y. Freund and R. E. Schapire. A decision- 5: for t = 1 to T do
theoretic generalization of on-line learning and an 6: At each node i, compute the error in classifi-
mi
application to boosting. In EuroCOLT, pages 23– X
37, 1995. cation locally by ij = Dt (k)[yk 6= hj (xk )],
k=1
[6] H. Kargupta, B.-H. Park, and H. Dutta. Orthog- ∀j ∈ {1 . . . r}.
onal decision trees. IEEE Trans. Knowl. Data 7: Combine the local errors computed to get
Eng., 18(8):1028–1042, 2006. the global classification error for each of the
weak classifier hj (distributed sum) as j =
[7] B. K. Natarajan. Machine Learning: A Theoret- XN
ical Approach. Springer Netherlands. ij , ∀j = 1 . . . r.
i=1
8: Choose the minimum error classifier ht =
argmin(j ) at each node.
hj ∈H
9: Compute αt = 12 1−t at each node.
t
11