Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Distributed Decision Tree Learning in Peer-to-Peer Environments

using a Randomized Approach

June 28, 2010

Abstract decision tree learning algorithms are therefore of high


interest. However, most decision tree learning algo-
Learning decision trees from distributed data has rithms work by selecting an attribute in a greedy
been a challenge since straightforward distributed manner at every stage of the tree construction. This
implementation of popular centralized decision tree usually results in synchronized construction of the
learning algorithms suffer from high communication tree by exchanging the information stored at differ-
cost resulting from the selection of attributes at ev- ent nodes. This causes high communication overhead
ery stage of the tree. This paper offers an alternate and synchronous algorithms that do not scale very
communication-efficient randomized technique that well in large distributed environments like the P2P
is appropriate for large asynchronous peer-to-peer networks.
(P2P) networks. It first presents a brief overview of a
This paper motivates the current work by introduc-
P2P distributed data mining system called PADMini
ing the readers with the PADMini system that offers
that can be used for P2P classifier learning among
distributed classifier learning among other capabili-
others. The proposed distributed randomized deci-
ties over P2P networks. The paper poses the dis-
sion tree learning algorithm is then presented in the
tributed decision tree learning problem in that con-
context of this system and the related applications.
text. In order to avoid the root cause of synchronized
The paper offers a set of analytical results regarding
communication, the paper explores a class of random-
the properties of randomized decision trees used in
ized decision trees introduced elsewhere [4] that do
this paper following the construction proposed else-
not require greedy selection of the attributes at ev-
where [4]. The paper also presents a detailed de-
ery stage of the tree-construction. It investigates an-
scription of the distributed algorithm and extensive
alytical properties of those trees and adapts those to
experimental results.
design a distributed asynchronous decision tree learn-
ing algorithm for P2P networks.
1 Introduction The paper is organized as follows. Section 2 de-
scribes the PADMini system in order to motivate the
Distributed classifier learning from data stored at dif- current work. Section 3 reviews the existing litera-
ferent locations is an important problem with many ture. Section ?? defines the distributed decision tree
applications. Large peer-to-peer networks are creat- learning problem. Section ?? explores various analyt-
ing many such applications where centralized data ical properties of randomized decision trees. Section
collection and subsequent learning of classifiers are ?? presents a detailed description of the distributed
extremely difficult, if not impossible. Decision trees decision tree learning algorithm. Section ?? offers
[?] are popular classifier learning techniques widely the experimental results. Finally, section 7 concludes
used for many applications. Distributed versions of this paper.
2 Motivation

3 Related Work

A number of “multiple random decision tree ap-


proaches” have been developed since mid 90’s. Mul-
tiple decision trees are frequently referred to as ”en-
semble” classifiers and the different techniques pro-
posed were “bagging” [2], “boosting” [5], “random
forests” [3]. Amit et. al. [1] described these methods
as “holographic” methods because each data point in
the analysis can be resampled many times, under dif-
ferent circumstances and therefore have the ability to
contribute to the shaping of a multi-dimensional view
of the data. In bagging, the approach is typically to
form different random subsamples of the source data. Figure 1: Different ensembled based classification
In boosting, the approach is normally to reweight the techniques using random decision trees
contribution of subsampled observations. Observa-
tions are reweighted based on how frequently they
have been misclassified in previous runs in the multi-
tree sequence. Observations that have been poorly
4 Analysis: Why ensemble of
classified are given more weight than those that have purely random decision trees
been well classified. On the contrary, Random forests
are a combination of tree predictors such that each
(RDT) gives high accuracy
tree depends on the values of a random vector sam- and low generalization error
pled independently and with the same distribution
for all trees in the forest [3]. Fan et. al. pro-
posed a completely random decision tree algorithm For the sake of simplicity we present the analysis
that achieving much higher accuracy than the single for decision trees with binary attributes and binary
best hypothesis, with the advantages being its train- target class labels, but it can be easily extended to
ing efficiency as well as minimal memory requirement the generic case.
[4]. The following figure 1 compares the different
technqiues proposed for ensemble based learning. We
note that the random decision tree model resembles
the random forest model with only difference between
two being that the forest technique uses random sam-
ples to learn the decision trees but the one proposed 4.1 Accuracy (Strength) of the indi-
by Fan et. al. learn the random decision trees from
vidual RDT classifiers
the entire data (no sampling is done), also the ran-
dom forest technique selects the features randomly at
once by choosing a random vector, while the random Let us consider random (binary) decision trees of
decision tree picks up the features iteratively. depth k. Total number of possible such random de-

2
cision trees
     
n n−1 n−1−2
=N = × .2! × .(22 )! × . . .
1 2 22
n − 1 − 2 − . . . − 2k−1
 
... × .(2k !)
2k
k 
n − 2i + 1

Y n!
= i
= k+1 + 1)!
i=0
2 (n − 2

As in [2], an RDT predictor φ(x, T ) predicts a class


label j ∈ {0, . . . 1} (equivalently + and − target
class labels) and denote Q(j|x) = P (φ(x, T ) = j),
with the probability
P that the predictor classifies x
correctly is = j Q(j|x)P (j|x).

Now, let us concentrate on the on the accuracy


of the individual RDT classifiers and define the event
[1
Cx = ((φ(x, T ) = j) ∧ (target class for x is j))
j=0
denoting the event that X any predictor classifies x
correctly, so that P (C) = P (φ(x, T ) = j)P (j|x).
j

Also, let’s define the set of “essential features” Fx


as a (nonempty) subset of features, needed to classify
x correctly. We argue that there always exists such a
subset of features (not necessarily unique) for every
tuple x, s.t., in order to predict the target class
label of the tuple x correctly, we need all off these l
features Xi1 , Xi2 , . . . , Xil from these subset Fx to be
considered by our classifier (i.e., all of them must be
present in at least one path in the random decision
tree (as fixed bits)), this l will vary depending upon
data (explained in next figure).

Moreover, this set is “irreducibly essential”


or minimal in the sense that if we do not consider
any feature present for classification of x, then x Figure 2: Essential Features
can not be classified correctly by our classifier. As
explained in figure 2 , X2 is such an essential feature
for the given dataset.

Probability that an RDT Ti classifies x cor-


rectly = P (Cx ) = P (Fx ∈ Ti ) = Probability that all
the essential features are present (as a schemata) in
a (at least one) path in Ti .

3
techniques [2], the individual classifiers are con-
Now, probability that all of the l essential fea- structed from the random samples taken with
tures are chosen in a given path (schemata) in Ti replacement from the population, but in case of
(n−l) RDT construction, the entire population is used for
will be = P (Si ) = k−l .
(nk) the same, hence strength of the individual classifiers
Hence, the probability that Ti classifies x cor- must be higher.
2k
[
rectly = P (Cx ) = P (Si ) = p (let), which is 4.2 Accuracy of the ensemble classi-
i=1 fier (the population mean)
similar to the “strength” of the individual classifiers
defined by Breiman et al. [3], following which the As explained in the earlier section, for a given test
strength of the ensemble of RDTs can be defined in tuple x, we can consider predicting the target class
terms of the margin function as of x an RDT classifier as a Bernoulli trial (a coin
toss) with success probability (probability to classify
s = Ex,y mrdt(x, y) x correctly) as p.
where
Let Yi denote the random variable denoting
mrdt(x, y) = P (T (x) = y) − maxP (T (x) = j) the output of each trial with
j6=y

where
(
1 x correctly classified by Ti
Yi =
T ∈ {T1 , T2 , . . . , TN }, set of RDT classifiers. 0 x wrongly classified by Ti
PN
Also, since probability that Ti classifies x correctly Then Y = i=1 Yi ∼ B(N, p) follows a Binomial
= p and wrongly with probability = q = 1 − p, we distribution (population size = number of RDTs of
can think of N independent Bernoulli trials (one depth k is N ), with (population) mean N p.
for each RDT) with probability of success (i.e.,
probability of correct classification) = p. The aggregate predictor (for the ensemble of
the RDT classifiers) can be computed based on
Now, P (T (x) = y) is the proportion of the majority voting as [2]
trees expected to vote correctly ⇒ P (T (x) = y) =
φA (x) = argmax Q(j|x)
E[I(T (x) = y)] = p and P (T (x) 6= y) = q, for binary j
classification, we have
  or equivalently by averaging the individual posterior
1 probabilities given x (for all N RDTs in the popula-
mrdt(x, y) = p − q = 2. p −
2 tion of all possible depth k binary decision trees) as
(population mean) [4]
 
1
s = Ex,y mrdt(x, y) = 2. p −
2 N
!
1 X
Pi (y|x)
It can be easily seen that the strength (s) of the N i=1
individual RDT classifiers will be high in general
(when we have p > 12 ) and will keep on increasing The aggregate classifier gives the correct output
rapidly as the depth k of the generated RDTs iff majority (b N2 c + 1) of the N individual binary
increases. classifiers agree on their vote for the target class
label for the test tuple x correctly.
Additionally, we notice that in Bagging like

4
Hence, the aggregate classifier gives the correct shown.
output with probability
  The population mean class distribution vector
N is obtained by averaging the class distributions of all
= P (Y ≥ + 1) X
2 N possible RDTs of depth k by µ = N 1
N Pi (y|x).
 
N i=1
= 1 − P (Y ≤ + 1) Similarly, the sample mean class distribution ob-
2
   tained by averaging the class distributions of all r
N X
=1−F + 1; N, p possible RDTs of depth k by µ̂ = N1 rPi (y|x).
2 i=1
2 !
1 N p − N2 + 1
 
Now, since the sample RDTs are independently
≥ 1 − exp −
2p N chosen, by the Central Limit Theorem, we have
 
= 1 − pe µ − µ̂ D
√ → ℵ(0, 1)
 2
 σ/ r
where F (k; N, p) ≤ exp − 21p (N p−k) by Chernoff’s
   
N µ − µ̂ µ − µ̂
P √ <  = 1 − 2Φ √
bound. σ/ r σ/ r
Also, by Chebyshev’s inequality, we have
Since pe is likely to be small we have a high
Pr(|µ − µ̂| ≥ √σr ) ≤ r. These inequalities can
accuracy for the aggregate classifier and hence
taking the true “population mean” of all the RDTs be used to find minimum number (r) of RDTs
explains the reason for the high accuracy of the required to be generated to ensure that sample mean
ensemble classifier. does not deviate much from the population mean, as
described in the next section.

4.4 Number of RDTs to be generated


4.3 Accuracy of the ensemble classi-
fier (the sample mean) 4.4.1 Number (r) of RDTs required for the
ensemble classifier to produce at most
But the space of all possible RDTs of depth k is  error in classification
exponential, hence only r from N (the set of all
possible RDTs of depth k) are chosen as individual As shown earlier, the aggregate classifier classifies a
classifiers. test tuple x correctly with probability at least 1 − pe ,
hence it classifies x wrongly with at most  = pe
As explained in [4], the construction of the purely probability. If we want to  choose r RDTs, we
2
 shall
(rp−b r2 c+1)
random decision trees can be thought of random have Y ∼ B(r, p) and exp − 1
≤ .
2p r
sampling of RDTs from the population of all possible
RDTs of size N and by Strong Law of Large Numbers Hence, given , we can solve for r by
jrk 2
(SLLN) we know that sample aggregate φ̂(A) is go-

rp − + 1 ≥ 2pr.loge 
ing to converge to the population aggregate φ(A), a.s. 2
the least number (r) of RDTs required to classify x
Also, by applying the Central Limit Theorem wrongly with probability at most .
(CLT) [4], we can argue that class label predicted by
the sample aggregate ensemble classifier is going to
Lemma 1
approach the class label predicted by the population
aggregate classifier, the later is able to predict the The error in prediction using the ensemble fˆ de-
target class with high accuracy, as we have already creases as we choose more and more random de-

5
cision trees, i.e, P (fˆ(x) 6= y) → 0 as r → ∞. the random trees generated posses a schemata (a
In other words, if we have the ideal (optimal) de- path) that has X2 = 1 as a fixed bit (X2 was never
cision function f , then ˆ selected while construction of any of the random
r
! f converges to f a.s., i.e.,
X trees, hence all of them have X2 = ∗). In this
lim P wi fi → f = 1. case, we can’t say anything specific about the target
r→∞
i=1 class label depending upon the specific value of X2
from our ensemble classifier (although this is not
Proof: going to reduce the accuracy when X2 does not
have a major role in target classification, but in the
The generalization error for forests converges a.s. to
average case, if the feature X2 has contribution in
a limit as the number of trees in the forest becomes
target classification) and we treat this example to
large, which is ensured by SLLN ([3], theorem 1.2).
be an insufficient representation. In other words,
Also, the generalization error of a forest of tree clas-
we define the “sufficient representation” (this has
sifiers depends on the strength of the individual trees
nothing to do with the traditional machine learning
in the forest and the correlation between them ([3],
representation defined by [7]) by the notion that for
theorem 2.3). Since the attributes are chosen inde-
every attribute Xi , i = 1 . . . n, the class of r random
pendently, hence they are highly uncorrelated and
trees generated must have at least one schemata
the error will be upper bound by a small value.
(path) having a “fixed bit” corresponding to that
attribute.
4.4.2 Representing the decision trees: the
“sufficient representation” and the Definition 1
“fixed bit” heuristic
A collection of random decision trees {fi : i = 1 . . . r}
Random Decision Trees can be thought of as func-
provides a “sufficient representation” of the under-
tions {fi }ri=1 with fi : X n → Di , ∀i, where Di is
lying domain iff for each feature Xi , i = 1 . . . n, it
the probability distribution of the class lables. We
contains at least one schemata with a “fixed bit”
shall represent our final classifier function fˆ as a lin-
corresponding to that feature.
ear (convex) combination of all random decision tree
Xr
functions, i.e., fˆ = wi fi , with properly chosen Lemma 2
i=1
weights wi , which can be chosen uniformly (simple The minimum number (rmin ) of random decision
averaging), by majority (like bagging) or by greedy trees required to ensure representation sufficiency to
boosting techniques. log ( 1−δ )
a degree of δ% confidence is given by log np , where
Since there are n different features and each can 2 2k−1
 
have 3 possible values in a given schema (namely 0, p = 1 − n1 1 − n−1 1 1
. . . 1 − n−k+1 .
1 or *), along with 2 different possible target class
labels (+ or -, since binary classifier), we have 3n .2 Proof
different concepts. Also, we need to have at least one
name for each concept and no two concepts can share Probability that a given feature Xi was never
a name, in order to have a “representation” for the selected while building a given random tree
class of concepts [7]. Hence we need to have at least  2  2k−1
3n .2 different names, in order to have a “representa- = p = 1 − n1 1 − n−1
1 1
. . . 1 − n−k+1 .
tion” of the class of concepts, which is exponential in
number of features. Since the trees are constructed independently,
Let’s consider an arbitray schemata “∗1∗” for probability that a given feature Xi was never se-
k = 3 and also additionally assume that none of lected while building any of the r generated random

6
trees required to be generated in order to guaranty
a δ% confidence is
log ( 1−δ
n )
rmin = log p .
For instance, if we want δ = 0.9 ⇒ rmin ≥ 1+log(n)
log(1/p) ,
from which we can compute the minimum number
of random decision trees required (rmin ) to ensure
representation sufficiency.

4.4.3 The “diversity” heuristic


The generalization error of a forest of tree classifiers
depends on the strength of the individual trees in
the forest and the correlation between them ([3], the-
orem 2.3). Since the attributes are chosen randomly
(two randomly chosen vectors are orthogonal) and
independently, hence they are highly uncorrelated
and the error will be upper bound by a small value.
This supports the “diversity” heuristic that turned
out to be very efficient: if the random decision trees
are diverse in nature, the accuracy increases.

Lemma 3

trees = pr . Assuming that each decision tree is chosen in


Now let’s define the event Si to be the event totally random manner (e.g., by tossing coin), the
denoting that the attribute Xi was selected in at number of expected trees to have r different (to
least one of the random decision trees generated maintain diversity heuristic [4]) random decison trees
(and hence present as a “fixed bit” in at least one = θ(rlog(r)), by the randomized Coupon Collector
of the schematas). Hence, P (S̄i ) = pr . Also, by problem.
our earlier definition of functional completeness,
we are interested in lower-bounding the following Proof
probability (all attributes are present as “fixed bits”
in at least one of the schematas in at least one of the Follows directly from coupon collector problem.
random decision trees):
! 4.4.4 Random Decision Trees and Functional
n
\ Xn Completeness
Sn 
P Si = 1 − P i=1 S̄i ≥ 1 − P (S̄i ) =
i=1 i=1 We note that any decision tree function is nothing
1 − n.pr , but a set of schematas (or equivalent classes). Since
the random decision trees generated are iso-height
by union bound. (all of same height k < n) and full binary trees of
depth k, we have 2k different schematas (equivalent
If we call this our confidence δ, then the above classes) that represent a single random decision tree
inequality gives us a very straightforward formula to (with binary features and binary target class labels).
find r, the (minimum) number of random decision So equivalently our desired target classifier can be

7
represented as f : {h1 , h2 , . . . h2k } → {0, 1}. Post’s Functional Completeness Theorem

Also, each (partial) schemata consists of any of This theorem defines 5 different classes Ci for boolean
the 2 fixed values (1, 0) or a variable value (∗) for a functions (β, γ, alternating, A : a, self − dual) and
given feature. Hence any arbitrary decision tree f states that a set F = {fi |i = 1 . . . r} of truth func-
can be represented by the linear combination of the tions (set of r RDTs) will be functionally complete
r random trees essentially means that any arbitrary iff ∀i ∈ {1 . . . 5}, ∃f ∈ F |f ∈
/ Ci .
schemata can be represented as linear combination
5
of the schematas present in the randomly generated \
P (F is functionally complete) = P ( (∃f ∈ F |f ∈
/ Ci ))
trees. We note that a single tree is enough to
i=1
represent all possible schemata.
5
Y 5
Y
= (1 − P (∀f ∈ F |f ∈ Ci )) = (1 − pi )
Definition 2
i=1 i=1

The family of functions {fi : i = 1 . . . r} are Since pi s are likely to be small, the set of random
functionally complete iff they form a linearly in- decision trees generated are likely to be functionally
dependent (basis) set of functions in the space of complete.
decision tree functions.
4.5 Random Decision Trees: Orthog-
Definition 3
onal?
The set of functions {fi : i = 1 . . . r} are linearly inde- We can compute the Fourier coefficients wj od a RDT
r
X function f by [6]
pendent iff ∀x ∈ X n , λi .fi (x) = 0 ⇒ λi = 0, ∀i.
i=1 1 X
wj = f (h)ψj (h)
We shall now be interested to answer the follow- |∧|
h∈∧
ing question: how many random decision trees are
needed to achieve functional completeness. We note As shown in the same paper [6], in a depth k RDT,
that by function completeness, we mean that any ar- all the Fourier coefficients of order ≥ k are zero.
bitrary decision tree function f should be expressed
as a linear combination of the random decision tree Also, wj 6= 0 ⇒ ∃ a schema h ∈ ∧ in the
function vectors (fi (x), i = 1 . . . r) only. RDT s.t. h has a fixed bit in all the positions where
wj has a 1 bit.
Let f : {0, 1, ∗}n → {0, 1} be our target class
function that we want to represent using linear com-
Let w be a Fourier Coefficient and o(w) = l.
bination of fi . Now, given any tuple x ∈ {0, 1, ∗}n ,
Now consider depth k binary RDT that has 2k
r
X different paths. Define IPi , i = 1 . . . 2k to be the
P (y = +|x) = f (x) = wi .fi (x) event that the schemata represented by Pi does not
i=1 fixed bits in all the positions where wj has set bits.
r
X
= wi .P (yi = +|x), ∀x ∈ {0, 1, ∗}n , wi ∈ [0, 1].  k
2

\
i=1
Hence, P (IPi ) = 1− nl ⇒ P (w 6= 0) = P  IPi  =
Hence we have 3n equations with r unknown vari-
i=1
ables (wi s) and we have to find the minimum value k
l 2
of r for which the simultaneous linear system of 1− n , which is very small.
equations is consistent.
Hence, most of the Fourier coefficients for the

8
RDTs are likely to be zeros, thereby yielding a 4.8 Training v.s Testing Error
very low inner-product between any two Fourier-
coefficient vectors, hence any two RDTs will be 4.9 Learning model with confidence
orthogonal to a high degree. Experimental results at least δ and error at most 
show the inner product lies in between 0.1 − 0.2.
5 DRDT: Distributed exten-
sion of RDT
4.6 Correlation between the individ- 5.1 The Algorithm
ual RDTs in terms of the Raw The distributed extension of the RDT algorithm is
Margin Function pretty straight-forward. Instead of using the entire
training data (population) for the construction of the
4.7 Generalization Error of the En- individual RDTs, we use random samples (with re-
placement) as in [2], [3] to construct RDTs at differ-
semble Classifier
ent sites (nodes), one RDT per each node. For testing
purpose, a testing tuple is sent to all the sites where
From the analysis of the random forest paper [3], we the probability distribution vector for its class label
see that the generalization error is predicted by the corresponding RDT. The mean
vector of the probability distribution vectors is com-
puted by distributed average computation, in order
1 − s2
P E ∗ ≤ ρ̄ 2 to obtain the target class label.
s
Algorithm 1 DRDT: The Distributed Random De-
which is dependent upon cision Tree Ensemble Classification
1: for each node parallelly do
2: Randomly (with replacement) select N data
1. The mean value of the correlation coefficient be- tuples from the training data (population).
tween the features selected for each of the in- 3: Locally generate a random decision tree struc-
dividual classifiers (since the RDTs constructed ture of height k by using build RDT technique
are of small heights (shallow) [4], ρ̄ will obviously by Fan et al.
be low, thereby decreasing the lower bound of 4: Locally compute leaf node class distribution for
the generalization error, as shown previously. the RDT using ComputeStatistics technique by
Fan et al.
5: end for
6: Given a test tuple x, send x to every node.
2. The strength s of the individual random decision 7: At each node parallelly compute the posterior
tree classifiers, which is high enough (as per the class probability distribution vector locally.
lower bound proof shown in the earlier section). 8: Use Distributed Mean to compute the average
vector of all the locally computed class distribu-
tion vectors, to obtain the target class probability
As discussed in [3], since the growing depth of the distribution.
classifiers (individual RDTs) cause increase in the 9: Output the target class label corresponding to the
strength and increase in correlation simultaneously, highest probability in the mean vector computed.
this is a tradeoff. The half-full heuristic proposed by
Fan et al. gives good results as explained in [4].

9
5.2 Analysis and hence it makes sense to assign weights to the
individual classifiers prediction, in order to improve
Similar to analysis in section 4. accuracy of the ensemble classifier.

5.3 Experimental Results In this section we propose some technique to


assign weights to the decision trees generated
We ran a series of experiments each time varying a and instead of simple averaging we do a weighted
set of parameters affecting the accuracy of our dis- aggregation to compute the ensemble classifier.
tributed ensemble classifier. We tried our experiment
on the astronomy dataset with binary target class
labels (EAR and DMS respectively) and noted the
6.1 Choosing Weights by Solving an
accuracy by changing different parameters. Optimization Problem
We can find the weights to the random decision trees
generated and maximize the ensemble classification
accuracy by the following quadratic minimization
problem:

r
!
X
min P (y = +|x) − wi .P (yi = +|x)
w
i=1
s.t. x ∈ {0, 1, ∗}n , wi ∈ [0, 1]

6.2 AdaBoost: A Boosting Approach


• In each round (t = 1 . . . T ) of AdaBoost, a
“Weak” (baseline) classifier ht (slightly better
than random guess) is chosen corresponding to
which the classification error is minimum, i.e.,
Xm

Figure 3: Change in the Accuracy of the Ensemble ht = argmin(j ), where j = Dt (i)[yi 6=


hj ∈H
Classifier w.r.t. the change in the Size of Random i=1
Sample (used to train RDT) at each Node hj (xi )].
1 1−t
• αt = 2 t , the weight of ht .
Dt (i)exp(−αt yi ht (xi ))
• Dt+1 (i) = Zt .
6 Weighted Extension to
• End if t > 0.5.
RDTs: Choosing Weights
• In the end, a “Strong” classifier (H) is learnt as
for Random Decision Trees an additive weighted combination of the weak
T
So far we were only concerned about simple aver- classifiers as H(x) =
X
αt ht (x).
aging of the posterior class probability distributions t=1
generated by the RDTs given a test tuple x, i.e.,
we have assigned equal weights (importance) to the • Can be shown that the classification error (mar-
predictions from each of the RDT classifiers. But gin) exponentially decreases (increases) in this
some of the classifiers may not be as good as others process of learning the ensemble.

10
6.3 The Distributed Algorithm
Algorithm 2 Distributed AdaBoost Technique
1: Horizontally partition the data set Dm×n into N
N
7 Conclusion i
partitions Dm , D =
[
Di and assign ith par-
i ×n
i=1
References N
X
tition to node ℵi , (where m = mi ).
[1] Y. Amit and D. Geman. Shape quantization and i=1
recognition with randomized trees. Neural Com- 2: At each node i, generate the same set of r random
putation, 9(7):1545–1588, 1997. decision trees (Ti1 , . . . , Tir ) using same (psuedo)
random generator (with same seed) and compute
[2] L. Breiman. Bagging predictors. Machine Learn- the leaf probability distributions locally.
ing, 24(2):123–140, 1996. 3: Combine the leaf probability distributions (as
distributed sum) of the locally generated trees
[3] L. Breiman. Random forests. Machine Learning, to obtain the global decision trees (h1 , . . . , hr )
45(1):5–32, 2001. that are to be there in each of the nodes and
[4] W. Fan, H. Wang, P. S. Yu, and S. Ma. Is random the week classifiers are to be chosen by AdaBoost
model better? on its accuracy and efficiency. In from these trees.
1
ICDM, pages 51–58, 2003. 4: D1 (i) = m , intial distribution (assuming each
node knows the total number of tuples m).
[5] Y. Freund and R. E. Schapire. A decision- 5: for t = 1 to T do
theoretic generalization of on-line learning and an 6: At each node i, compute the error in classifi-
mi
application to boosting. In EuroCOLT, pages 23– X
37, 1995. cation locally by ij = Dt (k)[yk 6= hj (xk )],
k=1
[6] H. Kargupta, B.-H. Park, and H. Dutta. Orthog- ∀j ∈ {1 . . . r}.
onal decision trees. IEEE Trans. Knowl. Data 7: Combine the local errors computed to get
Eng., 18(8):1028–1042, 2006. the global classification error for each of the
weak classifier hj (distributed sum) as j =
[7] B. K. Natarajan. Machine Learning: A Theoret- XN
ical Approach. Springer Netherlands. ij , ∀j = 1 . . . r.
i=1
8: Choose the minimum error classifier ht =
argmin(j ) at each node.
hj ∈H
9: Compute αt = 12 1−t at each node.
t

10: If t > 0.5 then break.


11: Compute the new distribution (locally)
Dt+1 (k) = Dt (k)exp(−αZt
t yk ht (xk ))
at each node
i and compute the normalization factor Zt
globally.
12: end for
13: At each node compute the strong classifier
T
!
X
H(x) = sign αt ht (x) .
t=1

11

You might also like