Week 6 Decision Trees

MACHINE
LEARNING
DR. SAEED UR REHMAN
Department of Computer Science,
COMSATS University Islamabad
Wah Campus
Introduction to Machine Learning
CHAPTER 9:
Decision Trees
DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

Outline
Introduction
Univariate Trees
Classification Trees
Regression Trees
Pruning
Rule Extraction from Trees
Learning Rules from Data
Multivariate Trees

Summary Multivariate
Methods
•One inconvenience with multivariate data is that when the number of
dimensions is large, one cannot do a visual analysis.
•There are methods proposed in the statistical literature for displaying
multivariate data.
•Most work on pattern recognition is done assuming multivariate normal
densities.
•Sometimes such a discriminant is even called the Bayes’
optimal classifier, but this is generally wrong; it is only optimal if the
densities are indeed multivariate normal and if we have enough data to
calculate the correct parameters from the data.
•One obvious restriction of multivariate normals is that it does not allow for
data where some features are discrete. A variable with n possible values can
be converted into n dummy 0/1 variables, but this increases dimensionality.

Outline
Introduction
Univariate Trees
Classification Trees
Regression Trees
Pruning
Rule Extraction from Trees
Learning Rules from Data
Multivariate Trees

Decision Trees …Intro
•A decision tree is a hierarchical data structure implementing the
divide-and-conquer strategy.
• It is an efficient nonparametric method, which can be used for
both classification and regression.
•The learning algorithms can build the tree from a given labeled
training sample, as well as how the tree can be converted to a set
of simple rules that are easy to understand.

Why Decision Trees?
•In parametric estimation, we define a model over the whole input
space and learn its parameters from all of the training data.
•Then we use the same model and the same parameter set for any test input.
•In nonparametric estimation, we divide the input space into local regions, defined
by a distance measure like the Euclidean norm, and for each input, the
corresponding local model computed from the training data in that region is used.
•Decision trees can handle both categorical and continuous data.
Categorical Continuous
1 2 3 4
0 10 20 30 40

Decision Tree… Structure
•A decision tree is a hierarchical model for
supervised learning whereby
the local region is identified in a sequence
of recursive splits in a smaller
number of steps.
•A decision tree is composed of internal
decision nodes and terminal leaves.
•Example (in figure)

•of a dataset and the
correspondingdecision tree

Decision Tree… Structure
A decision tree is also a nonparametric model in the sense that we
do not assume any parametric form for the class densities and the
tree structure is not fixed a priori but the tree grows, branches
and leaves are added, during learning depending on the
complexity of the problem inherent in the data.

Divide and Conquer
•Internal decision nodes
• Univariate: Uses a single attribute, xi
• Numeric xi : Binary split : xi > wm
• Discrete xi : n-way split for n possible values
• Multivariate: Uses all attributes, x
• wm is a suitably chosen threshold value
•Leaves
• Classification: Class labels, or proportions
• Regression: Numeric; r average, or local fit
•Learning is greedy; find the best split recursively (Breiman et al, 1984; Quinlan,
1986, 1993)

Divide and Conquer
•Each fm(x) defines a discriminant (at decision nodes)in the d-dimensional
input space dividing it into smaller regions that are further subdivided as we
take a path from the root down.
•fm(·) is a simple function and when written down as a tree, a complex
function is broken down into a series of simple decisions. Different decision
tree methods assume different models for fm (·),
•and the model class defines the shape of the discriminant and
the shape of regions.
•Each leaf node has an output label, which in the case of classification is the
class code and in regression is a numeric value. A leaf node defines a localized
region in the input space where instances falling in this region have the same
labels (in classification), very similar numeric outputs (in regression) .

Example[1]

Univariate Trees
•The test in a decision node uses only one input variable.
•If the used input dimension, Xj, is discrete, taking one of n possible values, the decision
node checks the value of Xj and takes the corresponding branch, implementing an n-way
split.
•For example,
•If an attribute is color ∈ {red, blue, green}
A decision node has discrete branches and a numeric input should be

discretized. If xj is numeric (ordered), the test is a comparison
•where w is a suitably chosen threshold value.

m0

Univariate Trees
Binary Split:
•The decision node divides the input space into
two: Lm = {x|xj > wm0} and Rm = {x|xj ≤ wm0};
this is called a binary split.
•Successive decision nodes on a path from the
root to a leaf further divide these into two
using other attributes and generating splits
orthogonal to each other as shown here
•The leaf nodes define hyperrectangles in the
input space

Univariate Trees
• Tree learning algorithms are greedy and, at each step, starting at
the
root with the complete training data, we look for the best split.
• This splits the training data into two or n, depending on whether
the chosen attribute is numeric or discrete.
• We then continue splitting recursively with the corresponding
subset until we do not need to split anymore, at which point a
leaf node is created and labeled.

Classification Trees
• The decision tree for classification, are named as classification trees.
•The goodness of a split is quantified by an impurity measure.

•A split is pure if after the split, for all branches, all the instances choosing a branch
belong to the same class.
•Let us say for node m, Nm is the number of training instances reaching node m. For the
root node, it is N.
Ni m of Nm belong to class Ci, with
•Given that an instance reaches node m, the estimate for the probability of class Ci is

•Node m is pure if pim for all i are either 0 or 1.
•It is 0 when none of the instances reaching node m are of class Ci, and it is 1 if all such
instances are of Ci.
•If the split is pure, we do not need to split any further and can add a leaf node labeled
with the class for which pim is 1.
One possible entropy function to measure impurity is entropy
•where 0 log 0 ≡ 0. Entropy in information theory specifies the minimum

number of bits needed to encode the class code of an instance

For all attributes, discrete and numeric, and for a numeric
attribute for all split positions, we calculate the impurity
and choose the one that has the minimum entropy.

•Then tree construction continues recursively and in parallel
for all the Classification and branches that are not pure, until
all are pure.
•This is the basis of the (CART) and ID3 Algorithm , and its
extension C4.5.
•The C4.5 pseudocode of the algorithm is givenin figure in
next slide.


Regression Trees
•A regression tree is constructed in almost the same manner as a
classification tree, except that the impurity measure that is
appropriate for classification is replaced by a measure appropriate
for regression.
•Let us say for node m, Xm is the subset of X reaching node m;
namely, it is the set of all x ∈ X satisfying all the conditions in the
decision nodes on the path from the root until node m. We define

Regression Trees
•Tree learning algorithms are greedy and, at each step,
starting at the root with the complete training data, we look
for the best split.
•This splits the training data into two or n, depending on
whether the chosen attribute is numeric or discrete.
•We then continue splitting recursively with the
corresponding subset until we do not need to split anymore,
at which point a leaf node is created and labeled.

Regression Trees
•In regression, the goodness of a split is measured by the mean
square
error from the estimated value. Let us say gm is the estimated
value in
node m.
•If the error is not acceptable, data reaching node m is split further
such that the sum of the errors in the branches is minimum.

Decision Trees …Types
ID3 (Iterative Dichotomiser 3)
C4.5 (Successor of ID3)
CART (Classification And Regression Tree)
CHAID (CHi-squared Automatic Interaction Detector). Performs
multi-level splits when computing classification trees
MARS: Extends decision trees to handle numerical data better

Decision Trees …ID3 (Iterative
Dichotomiser 3) Algorithm
• Most common decision tree algorithm introduced in 1986
by Ross Quinlan
• The ID3 algorithm builds decision trees using a top-down
• Use greedy approach
• It uses Entropy and Information Gain to construct a decision
tree

Decision Trees … Entropy
Entropy
•Entropy, also called as Shannon Entropy is denoted by H(S) for a finite set
S, is the measure of the amount of uncertainty or randomness in data.
•S — The current (data) set for which entropy is being calculated (change
every iteration of the ID3 algorithm)
•x — Set of classes in S x={ yes, no }
•p(x) — The proportion of the number of elements in class x to the number
of elements in set S

•Given a collection S containing positive and negative
examples of some target concept, the entropy of S relative
to this boolean classification is:
•Here, p+ and p- are the proportion of positive and negative

examples in S
•For a binary classification problem with only two classes,
positive and negative class.

If all examples are positive or all are negative (if all members
of S belong to the same class) then entropy will be 0 i.e, low.
For example, if all examples are positive (p+ = 1), then p- is 0,
and Entropy(S) = -1. log2(1) -0. log2 0 = -1. 0 -0. log2 0 = 0.
If half of the records are of positive class and half are of
negative class (collection contains an equal number of positive
and negative examples) then entropy is 1 i.e, high.
For example, if half of the records are of positive (p+ = 0.5) ,
then p- is 0.5, and Entropy(S) = -0.5. log2(0.5) -0.5. log2 0.5 = 1.
If the collection contains unequal numbers of positive and
negative examples, the entropy is between 0 and 1

Decision Trees …Information
Gain
Information gain is also called as Kullback-Leibler divergence
denoted by IG(S,A) for a set S is the effective change in
entropy after deciding on a particular attribute A.

ID3 Algorithm Example
Consider a piece of data collected from computer shop where the
features are age, income, student, credit rating and the outcome
variable is whether a customer buy a computer or not. Now, our
job is to build a predictive model which takes in above 4
parameters and predicts whether customer will buy a computer or
not. We’ll build a decision tree to do that using ID3 algorithm.


ID3 Algorithm will perform following tasks recursively
1. Create root node for the tree
2. If all examples are positive, return leaf node ‘positive’
3. Else if all examples are negative, return leaf node ‘negative’
4. Calculate the entropy of current state H(S)
5. For each attribute, calculate the entropy with respect to the attribute ‘x’
denoted by H(S, x)
6. Select the attribute which has maximum value of IG(S, x)
7. Remove the attribute that offers highest IG from the set of attributes
8. Repeat until we run out of all attributes, or the decision tree has all leaf nodes.

Step 1 : The initial step is to calculate H(S), the Entropy of the
current state. In the above example, we can see in total there are
5 No’s and 9 Yes’s

Step 2 : The next step is to calculate H(S,x), the entropy with
respect to the attribute ‘x’ for each attribute. In the above
example, The expected information needed to classify a tuple in S
if the tuples are partitioned according to age is,


Step 3 : Choose attribute with the largest information gain, IG(S,x)
as the decision node, divide the dataset by its branches and
repeat the same process on every branch.
•Age has the highest information gain among the attributes, so Age
is selected as the splitting attribute.


Step 4a : A branch with entropy of 0 is a leaf node.

Step 4b : A branch with entropy more than 0 needs further splitting.
Step 5 : The ID3 algorithm is run recursively on the non-leaf branches, until all data is classified

Decision Tree to Decision Rules
A decision tree can easily be transformed to a set of rules by
mapping from the root node to the leaf nodes one by one.

Decision Tree to Decision Rules
R1 : If (Age=Youth) AND (Student=Yes) THEN Buys_computer=Yes
R2 : If (Age=Youth) AND (Student=No) THEN Buys_computer=No
R3 : If (Age=middle_aged) THEN Buys_computer=Yes
R4 : If (Age=Senior) AND (Credit_rating=Fair) THEN
Buys_computer=No
R5 : If (Age=Senior) AND (Credit_rating =Excellent) THEN
Buys_computer=Yes

Advantages
• Allows a fast localization of the region covering an input.
• Easy to interpret(can be converted to a set of IF-THEN rules).
• Easy to understand.
• Resistant to outliers, hence require little data preprocessing.
• They require relatively less effort for training the algorithm.
• They’re very fast and efficient compared to KNN and other
classification algorithms

Pruning
•Pruning is a technique in machine learning and search algorithms
that reduces the size of decision trees by removing sections of
the tree that provide little power to classify instances.
•Pruning reduces the complexity of the final classifier, and hence
improves predictive accuracy by the reduction of overfitting.
•Pre pruning stops a tree construction early on before it is full.
•Post-pruning that allows the tree to perfectly classify the training
set, and then post prune the tree.

Pruning… Example

References
This lecture is prepared from the following resources in addition
to the text book.
[1] https://medium.com/@kaumadiechamalka100/decision-tree-in-machine-learning-c610ef087260

Next Lecture outline
Linear Discrimination
Introduction
Generalizing the Linear Model
Geometry of the Linear Discriminant
 Two Classes
 Multiple Classes
Pairwise Separation
Parametric Discrimination Revisited
Gradient Descent


Week 6 Decision Trees

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Week 6 Decision Trees

Uploaded by

Copyright:

Available Formats

MACHINE

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

•Example (in figure)

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

A decision node has discrete branches and a numeric input should be

•where w is a suitably chosen threshold value.

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

•The goodness of a split is quantified by an impurity measure.

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

One possible entropy function to measure impurity is entropy

•where 0 log 0 ≡ 0. Entropy in information theory specifies the minimum

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

•Here, p+ and p- are the proportion of positive and negative

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

DEPARTMENT OF COMPUTER SCIENCE, COMSATS UNIVERSITY ISLAMABAD - WAH CAMPUS

You might also like