Professional Documents
Culture Documents
07 - Classification Algorithms - Part III
07 - Classification Algorithms - Part III
Introduction
Decision Trees
1
Introduction
2
There exist many types of nonlinear classifiers
3
The figures below are such examples. This type of trees is
known as Ordinary Binary Classification Trees (OBCT). The
decision hyperplanes, splitting the space into regions, are
parallel to the axis of the spaces. Other types of partition are
also possible, yet less popular.
Root
Nodes
Leafs
4
Design Elements that define a decision tree.
• Each node, t, is associated with a subset Χ t X , where X
is the training set. At each node, Xt is split into two (binary
splits) disjoint descendant subsets Xt,Y and Xt,N, where
Xt,Y Xt,N = Ø
Xt,Y Xt,N = Xt
5
BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 11
Splitting Criterion: The main idea behind splitting at each
node is the resulting descendant subsets Xt,Y and Xt,N to be
more class homogeneous compared to Xt. Thus the criterion
must be in harmony with such a goal. A commonly used
criterion is the node impurity:
M
I (t ) Pi | t log 2 Pt | t
i 1
N ti
and P i | t
Nt
where N ti is the number of data points in Xt that belong to
class i. The decrease in node impurity (expected reduction
in entropy, called as information gain) is defined as:
N t , Nt,N
I (t ) I (t ) I (t ) I (t N )
Nt Nt
BIM488 Introduction to Pattern Recognition Classification Algorithms – Part III 12
6
• The goal is to choose the parameters in each node
(feature and threshold) that result in a split with the
highest decrease in impurity. (Highest Information Gain)
We wants higher information gain and lower entropy value
• Why highest decrease?
7
Summary of an OBCT algorithmic scheme:
8
Advantages
9
Example:
Suppose we want to train a decision tree using the following
instances:
Weekend Decision
Weather Parents Money
(Examples) (Category)
W1 Sunny Yes Rich Cinema
W2 Sunny No Rich Tennis
W3 Windy Yes Rich Cinema
W4 Rainy Yes Poor Cinema
W5 Rainy No Rich Stay in
W6 Rainy Yes Poor Cinema
W7 Windy No Poor Cinema
W8 Windy No Rich Shopping
W9 Windy Yes Rich Cinema
W10 Sunny No Rich Tennis
Entropy(S) =-pcinemalog2(pcinema)-ptennislog2(ptennis)-pshoplog2(pshop)-pstay_inlog2(pstay_in)
10
and we need to determine the best of:
Now we look at the first branch. Ssunny = {W1, W2, W10}. This is not
empty, so we do not put a default categorisation leaf node here. The
categorisations of W1, W2 and W10 are Cinema, Tennis and Tennis
respectively. As these are not all the same, we cannot put a
categorisation leaf node here. Hence we put an attribute node here,
which we will leave blank for the time being.
11
Looking at the second branch, Swindy = {W3, W7, W8, W9}. Again, this is
not empty, and they do not all belong to the same class, so we put an
attribute node here, left blank for now. The same situation happens with
the third branch, hence our amended tree looks like this:
12
Notice that Entropy(Syes) and Entropy(Sno) were both zero, because
Syes contains examples which are all in the same category (cinema),
and Sno similarly contains examples which are all in the same category
(tennis). This should make it more obvious why we use information
gain to choose attributes to put in nodes.
13
Avoiding Overfitting
Introduction
Decision Trees
14
References