Download as pdf or txt
Download as pdf or txt
You are on page 1of 44

Decision Tree Learning

Introduction
Decision tree learning is one of the most
widely used and practical method for
inductive inference
Decision tree learning is a method for
approximating discrete-valued target
functions, in which the learned function is
represented by a decision tree
Decision tree learning is robust to noisy data
and capable of learning disjunctive
expressions
Decision tree representation
Decision tree classify instances by sorting
them down the tree from the root to some
leaf node, which provides the classification of
the instance
Each node in the tree specifies a test of some
attribute of the instance, and each branch
descending from that node corresponds to
one of the possible values for this attributes
Decision Tree Template
Root
Drawn top-to-bottom or
left-to-right
Child Child Leaf
Top (or left-most) node =
Root Node
Descendent node(s) = Child Leaf
Child Node(s)
Bottom (or right-most)
node(s) = Leaf Node(s) Leaf

Unique path from root to


each leaf = Rule
Decision Tree for PlayTennis
When to Consider Decision
Trees
Instances describable by attribute-value pairs
Target function is discrete valued
Disjunctive hypothesis may be required
Possibly noisy training data

Examples (Classification problems):


Equipment or medical diagnosis
Credit risk analysis
Top-Down Induction of
Decision Trees
strigil Input ñÑmwññÉn cn-niwoiosmmini.is data ÑwÑwÑñ)
Entropy (1) T-uoniitoqpkmww-wgmgpy.ae }
'oñÑñEniow
wonoonminwoimiinr
'
nu
"

Data mixed Ñwiinua :wwÑÑoon

Data iiunoonmnriwoein.tn row


Entropy (2)
Information Gain

gain ↑ n% niioionfifn-iiinioiow.io iiiooonrinñwlñ


An Example
The Entropy of A1 is computed as the following:

[29+,35-] A1=? E(A) = -29/(29+35)*log2(29/(29+35))


35/(35+29)log2(35/(35+29))
ñÑÑñ
= 0.9937 entropy siniinñ ño w-wnnwo.ua noonan
;
,

The Entropy of True:


True False E(TRUE) = - 21/(21+5)*log2(21/(21+5))
5/(5+21)*log2(5/(5+21))
= 0.7960

The Entropy of False:


[21+, 5-] [8+, 30-]
E(FALSE) = -8/(8+30)*log2(8/(8+30))
30/(30+8)*log2(30/(30+8))
= 0.7426
Compute Information Gain
Gain (Sample, Attributes) or Gain (S,A) is expected
reduction in entropy due to sorting S on attribute A
Gain(S,A) = Entropy(S) - v values(A) |Sv|/|S|
Entropy(Sv)
So, for the previous example, the Information gain is
calculated:
G(A1) = E(A1) E(TRUE)
- (21+5)/(29+35) *

- (8+30)/(29+35) * E(FALSE)

= E(A1) - 26/64 * E(TRUE) - 38/64* E(FALSE)


= 0.9937 26/64 * 0.796 38/64* 0.7426
= 0.5465 I gain
; ≈Ñño ñÑonk
0.50
Training Examples
For the target concept PlayTennis

Day Outlook Temperature Humidity Wind PlayTennis

D1 Sunny Hot High Weak No


D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Weak Yes
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D10 Rain Mild Normal Strong Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
To Build a Decision Tree
We want to build a decision tree for the tennis
matches
The schedule of matches depend on the weather
(Outlook, Temperature, Humidity, and Wind)
So to apply what we know to build a decision tree
based on this table
Example
Calculating the information gains for each of
the weather attributes:
For the Wind
For the Temperature
For the Humidity
For the Outlook
For the Wind NO
=
yes
S=[9+,5-]
E=0.940 Wind
E =

log (9-4) ¥11092 (7-4) =


-0.94
-
-

,
,

jinns ñsuwa

8 6

Weak Strong
Weak → Yes -46 Weak → NO =
2

[6+, 2-] [3+, 3-]


E=0.811 E-
bzlogzlbz ) Hog (2-8)
2

E=1.0
= -
-

=
0.811

Gain(S,Wind):
=0.940 - (8/14)*0.811 - (6/14)*1.0
=0.048
For the Temperature
S=[9+,5-]
E=0.940
Temperature

Hot Mild Cool

[2+, 2-] [4+, 2-] [3+, 1-]


E=1.0 E=0.918 E=0.811

Gain(S,Temperature)
=0.940-(4/14)*1.0 -(6/14)*0.0 (4/14)*0.0971
=0.029
For the Humidity
S=[9+,5-]
E=0.940
Humidity

High Normal

[3+, 4-] [6+, 1-]


E=0.985 E=0.592

Gain(S,Humidity)
=0.940-(7/14)*0.985 (7/14)*0.592
=0.151
For the Outlook
S=[9+,5-]
E=0.940
Outlook

Sunny Overcast Rain

[2+, 3-] [4+, 0] [3+, 2-]


E=0.971 E=0.0 E=0.971

Gain(S,Outlook)
=0.940-(5/14)*0.971 -(4/14)*0.0 (5/14)*0.0971
=0.247
Next
Outlook is the winner for the root.
Now that we have discovered the root
of our decision tree we must now
recursively find the nodes that should
go below
Sunny,
Overcast,
and Rain.
For the Rain ^ Humidity
S=[3+,2-]
E=0.971
Rain

Humidity
: :
2

High Normal
[1+, 1-] [2+, 1-]
E=1.0 E=0.918

Gain(S,Rain^Humidity)
=0.971-(2/5)*1.0 (3/5)*0.918
=0.02
For the Rain ^ Temperature
S=[3+,2-] Rain
E=0.971
Temperature

Hot Mild Cool

[0+, 0-] [2+, 1-] [1+, 1-]


E=0.918 E=1.0

Gain(S,Rain ^ Temperature)
=0.971 - (3/5)*0.918 (2/5)*1.0
=0.02
For the Rain ^ Wind
S=[3+,2-] Rain
E=0.971
Wind

Weak Strong
[3+, 0-] [0+, 2-]
E=0.0 E=0.0

Gain(S,Rain^Wind)
=0.971-(3/5)*0.0 (2/5)*0.0
=0.971
Next
Wind is the winner for the Rain.
Overcast has only yes.
Outlook

Sunny Overcast Rain

Yes Wind
[D3,D7,D12,D13]

Strong Weak

No Yes
[D6,D14] [D4,D5,D10]
Then
For the Sunny, we are considering for
the Temperature and Humidity.
For the Sunny ^ Temperature
S=[2+,3-] Sunny
E=0.971
Temperature

Hot Mild Cool

[0+, 2-] [1+, 1-] [1+, 0-]


E=0.0 E=1.0 E=0.0

Gain(S,Sunny ^ Temperature)
=0.971 (2/5)*0.0 (2/5)*1.0 (1/5)*0.0
=0.571
For the Sunny ^ Humidity
S=[2+,3-] Sunny
E=0.971
Humidity

High Normal
[0+, 3-] [2+, 0-]
E=0.0 E=0.0

Gain(S,Sunny ^ Humidity)
=0.971-(3/5)*0.0 (2/5)*0.0
=0.971
Then
For the Sunny, the winner is
Humidity.
- -

Complete tree
Then here is the complete tree:
Outlook

Sunny Overcast Rain

Humidity Yes Wind


[D3,D7,D12,D13]

High Normal Strong Weak

No Yes No Yes

[D1,D2] [D8,D9,D11] [D6,D14] [D4,D5,D10]


Decision Tree Induction
Hypothesis Space Search by
ID3 z =
tree

Hypothesis space is complete


Target function surely in there

Only outputs a single hypothesis ;7ñ output oonniiñniioies

No back tracking
Local minima

Statically-based search choices jñinouñtoianoinmiñiwirs/ n'iain

Robust to noisy data


Inductive bias: prefer shortest tree
manis Tree riots in :&
C4.5 is an algorithm used to generate a decision tree
developed by Ross Quinlan
C4.5 made a number of improvements to ID3. Some
of these are:
Handling both continuous and discrete attributes
Handling training data with missing attribute value
Handling attributes with differing costs
Pruning trees after creation
Quinlan went on to create C5.0 and See5 (C5.0 for
Unix/Linux, See5 for Windows) which he markets
commercially.
The Over-fitting Issue
Over-fitting is caused by creating
decision rules that work accurately on
the training set based on insufficient
quantity of samples.
As a result, these decision rules may
not work well in more general cases.
Overfitting in Decision Trees

n'iii. goat , of
Reduced-Error Pruning
% mirin is
Rule Post-Pruning
Convert tree to equivalent set of rules
Prune each rule by removing any preconditions that result in improving
its estimated accuracy
Sort the pruned rules by their estimated accuracy, and consider them
in this sequence when classifying subsequent instance

Perhaps most frequently used method


Continuous Valued Attributes
Create a discrete attribute to test continuous

There are two candidate thresholds


The information gain can be computed for
each of the candidate attributes,
Temperature>54 and Temperature>85, and the
best can be selected(Temperature>54)
Attributes with many Values
Problems:
If attribute has many values, Gain will select it
Imagine using the attribute Data. It would have the highest
information gain of any of attributes. But the decision tree is not
useful.
Missing Attribute Values
Attributes with Costs
9m inonoioniop
.
oiowhrnoio cost

Consider
Medical diagnosis, BloodTset has cost 150 dallors

How to learn a consistent tree with low


expected cost?
Decision Tree Advantages
1. Easy to understand
2. Map nicely to a set of business rules
3. Applied to real problems
Tainosin oiinimwiny into ñu data

Make no prior assumptions about the data


=
.

4.

5. Able to process both numerical and categorical


data
Decision Tree Disadvantages
titi Nunn
'

Output Ñwoiiino
running
cate
log

1. Output attribute must be categorical

2. Limited to one output attribute

3. Decision tree algorithms are unstable

4. Trees created from numeric datasets can be


complex
Conclusion
Decision Tree Learning is
Simple to understand and interpret
Requires little data preparation
Able to handle both numerical and categorical
data
Use a white box model
Possible to validate a model using statistical
tests
Robust, perform well with large data in a
short time

You might also like