11 Descision Tree

Artificial Intelligence 11.
Decision Tree Learning

Course V231 Department of Computing Imperial College, London Simon Colton
What to do this Weekend?

If
my parents are visiting

Well go to the cinema
If

not
Then, if its sunny Ill play tennis But if its windy and Im rich, Ill go shopping If its windy and Im poor, Ill go to the cinema If its rainy, Ill stay in
Written as a Decision Tree

Root of tree
Leaves
Using the Decision Tree (No parents on a Sunny Day)
From Decision Trees to Logic

Decision
trees can be written as
Horn clauses in first order logic
Read
from the root to every tip
If this and this and this and this, then do this
In

our example:
If no_parents and sunny_day, then play_tennis no_parents sunny_day play_tennis
Decision Tree Learning Overview

Decision
tree can be seen as rules for performing a categorisation

E.g., what kind of weekend will this be?
Remember
that were learning from examples
Not turning thought processes into decision trees
We

need examples put into categories We also need attributes for the examples
Attributes describe examples (background knowledge) Each attribute takes only a finite set of values
The ID3 Algorithm - Overview

The

major question in decision tree learning
Which nodes to put in which positions Including the root node and the leaf nodes
ID3

uses a measure called Information Gain

Impurity
Based on a notion of entropy

in the data
Used to choose which node to put in next
Node
with the highest information gain is chosen
When there are no choices, a leaf node is put on
Entropy General Idea
From Tom Mitchells book:
In order to define information gain precisely, we begin by defining a measure commonly used in information theory, called entropy that characterizes the (im)purity of an arbitrary collection of examples
Want a notion of impurity in data Imagine a set of boxes and balls in them If all balls are in one box
This is nicely ordered so scores low for entropy Boxes with very few in scores low Boxes with almost all examples in scores low
Calculate entropy by summing over all boxes

Entropy - Formulae
Given
a set of examples, S For examples in a binary categorisation

Where p+ is the proportion of positives And p- is the proportion of negatives
For
examples in categorisations c1 to cn
Where pn is the proportion of examples in cn
Entropy - Explanation

Each category adds to the whole measure When pi is near to 1
(Nearly) all the examples are in this category
So it should score low for its bit of the entropy
log2(pi) gets closer and closer to 0

And this part dominates the overall calculation So the overall calculation comes to nearly 0 (which is good)
When pi is near to 0
(Very) few examples are in this category
So it should score low for its bit of the entropy
log2(pi) gets larger (more negative), but does not dominate
Information Gain
Given

set of examples S and an attribute A
A can take values v1 vm Let Sv = {examples which take value v for attribute A}
Calculate
Gain(S,A)
Estimates the reduction in entropy we get if we know the value of attribute A for the examples in S
An Example Calculation of Information Gain

Suppose

we have a set of examples
S = {s1, s2, s3, s4} In a binary categorisation

With
one positive example and three negative examples The positive example is s1
And
Attribute A
Which takes values v1, v2, v3
S1
takes value v2 for A, S2 takes value v2 for A
S3 takes value v3 for A, S4 takes value v1 for A
First Calculate Entropy(S)

Recall that
Entropy(S) = -p+log2(p+) p-log2(p-)
From binary categorisation, we know that

p+ = and p- =
Hence
Entropy(S) = -(1/4)log2(1/4) (3/4)log2(3/4) = 0.811
Note for users of old calculators:
May need to use the fact that log2(x) = ln(x)/ln(2)
And also note that, by convention:

0*log2(0) is taken to be 0
Calculate Gain for each Value of A

Remember
that
And
that Sv = {set of example with value V for A}
So, Sv1 = {s4}, Sv2 = {s1,s2}, Sv3={s3}
Now,
(|Sv1|/|S|) * Entropy(Sv1) = (1/4) * (-(0/1)*log2(0/1)-(1/1)log2(1/1)) = (1/4) * (0 - (1)log2(1)) = (1/4)(0-0) = 0 Similarly, (|Sv2|/|S|) = 0.5 and (|Sv3|/|S|) = 0
Final Calculation
So,
we add up the three calculations and take them from the overall entropy of S: answer for information gain:
Final
Gain(S,A) = 0.811 (0+1/2+0) = 0.311
The ID3 Algorithm

Given

a set of examples, S
Described by a set of attributes Ai Categorised into categories cj Such that A scores highest for information gain
Relative
1. Choose the root node to be attribute A
to S, i.e., gain(S,A) is the highest over all attributes
2. For each value v that A can take
Draw a branch and label each with corresponding v

Then
see the options in the next slide!
The ID3 Algorithm
For each branch youve just drawn (for value v)

If Sv only contains examples in category c
Then put that category as a leaf node in the tree Then find the default category (which contains the most examples from S)
If Sv is empty
Put this default category as a leaf node in the tree
Otherwise
Remove A from attributes which can be put into nodes Replace S with Sv Find new attribute A scoring best for Gain(S, A) Start again at part 2
Make sure you replace S with Sv
Explanatory Diagram
A Worked Example
Weekend W1 W2 W3 W4 W5 W6 W7 W8 W9 W10 Weather Sunny Sunny Windy Rainy Rainy Rainy Windy Windy Windy Sunny Parents Yes No Yes Yes No Yes No No Yes No Money Rich Rich Rich Poor Rich Poor Poor Rich Rich Rich Decision (Category) Cinema Tennis Cinema Cinema Stay in Cinema Cinema Shopping Cinema Tennis
Information Gain for All of S

S = {W1,W2,,W10} Firstly, we need to calculate:
Entropy(S) = = 1.571 (see notes) For all the attributes we currently have available
Next, we need to calculate information gain
(which is all of them at the moment)
Gain(S, weather) = = 0.7 Gain(S, parents) = = 0.61 Gain(S, money) = = 0.2816 Because this gives us the biggest information gain
Hence, the weather is the first attribute to split on
Top of the Tree
So, this is the top of our tree:
Now, we look at each branch in turn
In particular, we look at the examples with the attribute prescribed by the branch Categorisations are cinema, tennis and tennis for W1,W2 and W10 What does the algorithm say?
Ssunny = {W1,W2,W10}

Set is neither empty, nor a single category
Working with Ssunny

Weekend W1 W2 W10 Weather Sunny Sunny Sunny Parents Yes No No Money Rich Rich Rich Decision Cinema Tennis Tennis
Need to choose a new attribute to split on
Cannot be weather, of course weve already had that Gain(Ssunny, parents) = = 0.918 Gain(Ssunny, money) = = 0
So, calculate information gain again:

Hence we choose to split on parents
Getting to the leaf nodes
If its sunny and the parents have turned up
Then, looking at the table in previous slide
Theres only one answer: go to cinema
If its sunny and the parents havent turned up
Then, again, theres only one answer: play tennis
Hence our decision tree looks like this:
Avoiding Overfitting
Decision
trees can be learned to perfectly fit the data given

This is probably overfitting
The
answer is a memorisation, rather than generalisation
Avoidance
method 1: method 2:
Stop growing the tree before it reaches perfection Grow to perfection, then prune it back aftwerwards
Most
Avoidance
useful of two methods in practice
Appropriate Problems for Decision Tree learning

From

Tom Mitchells book:
Background concepts describe examples in terms of attribute-value pairs, values are always finite in number Concept to be learned (target function)
Has
discrete values
Disjunctive descriptions might be required in the answer
Decision

tree algorithms are fairly robust to errors
In the actual classifications In the attribute-value pairs In missing information

11 Descision Tree

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

11 Descision Tree

Uploaded by

Copyright:

Available Formats

Artificial Intelligence 11.

Decision Tree Learning

What to do this Weekend?

my parents are visiting

Written as a Decision Tree

Using the Decision Tree (No parents on a Sunny Day)

From Decision Trees to Logic

trees can be written as

Horn clauses in first order logic

from the root to every tip

If this and this and this and this, then do this

Decision Tree Learning Overview

tree can be seen as rules for performing a categorisation

that were learning from examples

Not turning thought processes into decision trees

The ID3 Algorithm - Overview

major question in decision tree learning

uses a measure called Information Gain

Based on a notion of entropy

Used to choose which node to put in next

with the highest information gain is chosen

When there are no choices, a leaf node is put on

Entropy General Idea

From Tom Mitchells book:

Calculate entropy by summing over all boxes

a set of examples, S For examples in a binary categorisation

Where p+ is the proportion of positives And p- is the proportion of negatives

Where pn is the proportion of examples in cn

Each category adds to the whole measure When pi is near to 1

(Nearly) all the examples are in this category

So it should score low for its bit of the entropy

log2(pi) gets closer and closer to 0

(Very) few examples are in this category

So it should score low for its bit of the entropy

log2(pi) gets larger (more negative), but does not dominate

set of examples S and an attribute A

An Example Calculation of Information Gain

we have a set of examples

S = {s1, s2, s3, s4} In a binary categorisation

Which takes values v1, v2, v3

takes value v2 for A, S2 takes value v2 for A

S3 takes value v3 for A, S4 takes value v1 for A

First Calculate Entropy(S)

From binary categorisation, we know that

Note for users of old calculators:

May need to use the fact that log2(x) = ln(x)/ln(2)

And also note that, by convention:

Calculate Gain for each Value of A

that Sv = {set of example with value V for A}

So, Sv1 = {s4}, Sv2 = {s1,s2}, Sv3={s3}

Gain(S,A) = 0.811 (0+1/2+0) = 0.311

The ID3 Algorithm

1. Choose the root node to be attribute A

to S, i.e., gain(S,A) is the highest over all attributes

2. For each value v that A can take

Draw a branch and label each with corresponding v

see the options in the next slide!

The ID3 Algorithm

For each branch youve just drawn (for value v)

If Sv only contains examples in category c

Put this default category as a leaf node in the tree

Make sure you replace S with Sv

Information Gain for All of S