Machine Learning Lecture 3 1205567881737248 5 PDF

Lecture No.
Ravi Gupta
AU-KBC Research Centre,
MIT Campus, Anna University
Date: 12.3.2008
Todays Agenda
Recap of ID3 Algorithm

Machine Learning Bias
Occams razor principle
Handling ID3 problems
Decision Trees
Decision tree learning is a method for approximating

discrete value target functions, in which the learned function
is represented by a decision tree.
Decision trees can also be represented by if-then-else rule.
Decision tree learning is one of the most widely used

approach for inductive inference .
Decision Trees
Edges: Attribute
value
Intermediate
Nodes: Attributes
Attribute: A1
Attribute Attribute
value Attribute
value
value
Attribute: A2 Output Attribute: A3

value
Attribute
Attribute Attribute Attribute
value
value value value
Output Output Output Output

value value value value
Leave node:
Output value
Decision Trees Representation
conjunction
disjunction
Decision Trees as If-then-else rule
conjunction
disjunction
If (Outlook = Sunny AND humidity = Normal) then PlayTennis = Yes

If (Outlook = Overcast) then PlayTennis = Yes
If (Outlook = Rain AND Wind = Weak) then PlayTennis = Yes
Problems Suitable for Decision Trees
Instances are represented by attribute-value pairs
The target function has discrete output values
Disjunctive descriptions may be required
The training data may contain errors
The training data may contain missing attribute values

Building Decision Tree
Attribute: A1
Attribute value
Attribute value
Attribute
value
Output value
Attribute: A2 Attribute: A3
Attribute value
Attribute value Attribute value Attribute value
Output value Output value

Outlook
Temperature Which attribute to
select ?????
Humidity
Wind
Root
node
Entropy
Given a collection S, containing positive and negative examples of

some target concept, the entropy of S relative to this boolean
classification (yes/no) is
where is the proportion of positive examples in S and p, is the

proportion of negative examples in S. In all calculations involving
entropy we define 0 log 0 to be 0.
Information Gain Measure
Information gain, is simply the expected reduction in entropy

caused by partitioning the examples according to this attribute.
More precisely, the information gain, Gain(S, A) of an attribute A,

relative to a collection of examples S, is defined as
where Values(A) is the set of all possible values for attribute A,

and Sv, is the subset of S for which attribute A has value v, i.e.,
Information Gain Measure
Entropy of S after
Entropy of S
partition
Gain(S, A) is the expected reduction in entropy caused by knowing the value of

attribute A.
Gain(S, A) is the information provided about the target &action value, given the
value of some other attribute A. The value of Gain(S, A) is the number of bits
saved when encoding the target value of an arbitrary member of S, by knowing
the value of attribute A.
Example
There are 14 examples. 9 positive and 5 negative examples [9+, 5-].
The entropy of S relative to this boolean (yes/no) classification is

Gain (S, Attribute = Wind)
Final Decision Tree
Some Insights into Capabilities and
Limitations of ID3 Algorithm
ID3s algorithm searches complete hypothesis space. [Advantage]
ID3 maintain only a single current hypothesis as it searches through

the space of decision trees. By determining only as single hypothesis, ID3
loses the capabilities that follows explicitly representing all consistent
hypothesis. [Disadvantage]
ID3 in its pure form performs no backtracking in its search. Once it

selects an attribute to test at a particular level in the tree, it never backtracks
to reconsider this choice. Therefore, it is susceptible to the usual risks of
hill-climbing search without backtracking: converging to locally
optimal solutions that are not globally optimal. [Disadvantage]
Some Insights into Capabilities and
Limitations of ID3 Algorithm
ID3 uses all training examples at each step in the search to make
statistically based decisions regarding how to refine its current
hypothesis. This contrasts with methods that make decisions
incrementally, based on individual training examples (e.g., FIND-S or
CANDIDATE-ELIMINATION). One advantage of using statistical
properties of all the examples (e.g., information gain) is that the
resulting search is much less sensitive to errors in individual training
examples. [Advantage]
Machine Learning Biases
Language Bias/Restriction Bias: Restriction on the

type of hypothesis to be learned. (Limits the set of
hypothesis to be learned/expressed).
Preference Bias/Search Bias: A preference for certain

hypothesis over others (e.g., shorter hypothesis), with no
hard restriction on the hypothesis space.
CANDIDATE-ELIMINATION Algorithm
Hypothesis was assumed to be conjunction of Attributes

Candidate-Elimination algorithm is Language biased

The problem is the algorithm considers (biased) only conjunctive space.
The following example requires a more expressive hypothesis space

Attribute: A1
Attribute value
Attribute value
Attribute
value
Output value
Attribute: A2 Attribute: A3
Attribute value
Attribute value Attribute value Attribute value

Decision Tree
ID3 algorithm has Preference/Search Bias

ID3 Strategy for Selecting Hypothesis
Selects trees that place the attributes with highest

information gain closest to the root.
Selects in favor of shorter trees over longer ones.

Preference Bias or Restriction Bias ?
A preference bias is more desirable than a restriction bias,

because it allows the learner to work within a complete
hypothesis space that is assured to contain the unknown
target function.
In contrast, a restriction bias that strictly limits the set of

potential hypotheses is generally less desirable, because it
introduces the possibility of excluding the unknown target
function altogether.
Preference Bias or Restriction Bias ?
ID3 exhibits a purely preference bias and CANDIDATE-ELIMINATION

a purely restriction bias, some learning systems combine both.
Preference Bias AND Restriction Bias ?
Task T: playing checkers

Performance measure P: % of games won in the world
tournament
Training experience E: games played against itself
Target function: F : Board R
Target function representation
F'(b) = w0 + w1x1+ w2x2 + w3x3 + w4x4 + w5x5 + w6x6
A linear combination of variables

(Language Bias/Restriction Bias)
E(Error)
< b , Ftrain ( b ) > training examples
(Ftrain (b) F '(b))
2
Preference Bias (Because weights are found based on Least

Mean Square technique)
Issues in Decision Tree Learning
Determining how deeply to grow the decision tree

Handling continuous attributes
Choosing an appropriate attribute
Selection measure
Handling training data with missing attribute values
Handling attributes with differing costs, and improving
computational efficiency
Occams Razor
Occam's razor (sometimes
spelled Ockham's razor) is a
principle attributed to the 14th-
century English logician and
Franciscan friar William of
Ockham.
The principle states that the

explanation of any
phenomenon should make as
few assumptions as possible,
eliminating those that make no
difference in the observable
predictions of the explanatory
hypothesis or theory.
Occams Razor
This is often paraphrased as "All other things being equal, the simplest
solution is the best."
In other words, when multiple competing theories are equal in other

respects, the principle recommends selecting the theory that introduces
the fewest assumptions and postulates the fewest entities. It is in this
sense that Occam's razor is usually understood.
Prefer the simplest hypothesis that fits the data

Why its called Occams Razor
Tom M. Mitchell says. Occam got this idea during shaving
Wikipedia says.. The term razor refers to the act of shaving

away unnecessary assumptions to get to the simplest
explanation.
ID3 Strategy for Selecting Hypothesis
Selects trees that place the attributes with highest

information gain closest to the root.
Selects in favor of shorter trees over longer ones.

Problem with Occams Razor
Why should simplest hypothesis that fits the data is best solution.
Why not second simplest or third simplest hypothesis.
The size of a hypothesis is determined by the particular

representation used internally by the learner. Two learners using
different internal representations could therefore arrive at different
hypotheses, both justifying their contradictory conclusions by
Occam's razor!
Training and Testing
For classification problems, a classifiers performance is

measured in terms of the error rate.
The classifier predicts the class of each instance: if it is correct,

that is counted as a success; if not, it is an error.
The error rate is just the proportion of errors made over a whole
set of instances, and it measures the overall performance of the
classifier.
We are interested in is the likely future performance on new

data, not the past performance on old data. We already know the
classifications of each instance in the training set, which after all
is why we can use it for training.
We are not generally interested in learning about those

classificationsalthough we might be if our purpose is data
cleansing rather than prediction.
So the question is, is the error rate on old data likely to be a good
indicator of the error rate on new data?
The answer is a resounding nonot if the old data was used
during the learning process to train the classifier.
Error rate on the training set is not likely to be a good

indicator of future performance.
Self-consistency Test: When training and test dataset are same
The error rate on the training data is called the resubstitution error,
because it is calculated by resubstituting the training instances into a
classifier that was constructed from them.
Hold out Strategy: Holdout method reserves a certain amount for
testing and uses the remainder for training (and sets part of that aside
for validation, if required).
In practical scenario we have limited

number of example with us.
K-fold Cross validation technique:
In the k-fold cross-validation, the dataset was partitioned randomly into k

equal-sized sets. The training and testing of each classifier were carried out k
times using one distinct set for testing and other k-1 sets for training.
4-Fold Cross-validation
ACC1
Test Dataset
Training Dataset
ACC2
Test Dataset
Training Dataset
ACC3
Test Dataset
Training Dataset
ACC4
Test Dataset
Training Dataset
ACC = (ACC1 + ACC2 + ACC3 + ACC4) / 4

Issues in Decision Tree Learning
Determining how deeply to grow the decision tree

Handling continuous attributes
Choosing an appropriate attribute
Selection measure
Handling training data with missing attribute values
Handling attributes with differing costs, and improving
computational efficiency
Avoiding Overfitting in Decision
Trees..
A hypothesis is said to be over-fitting the training

examples if some other hypothesis that fits the
training examples less well actually performs better
over the entire distribution of instances (i.e., including
instances beyond the training set).
Overfitting
H: Hypothesis Space
Overfitting
Negative
Positive example
example
Overfitting
h1 h2
Overfitting
h1 is more accurate
h1 than h2 on the training h2
examples
Overfitting
h1 is less accurate
h1 than h2 on the unseen h2
(test) examples
Overfitting
Is h1 more accurate
than h2 on training
examples
no
yes
Is h1 more accurate Is h1 more accurate

than h2 on test than h2 on test
examples examples
yes No No
yes
No over-fitting Over-fitting No over-fitting

Over-fitting
Overfitting
Overfitting in decision tree learning. As ID3 adds new nodes to grow the decision tree, the
accuracy of the tree measured over the training examples increases monotonically. However,
when measured over a set of test examples independent of the training examples, accuracy
first increases, then decreases.
Overfitting in Decision Tree
Overfitting in decision tree learning. As ID3 adds new nodes to grow the decision tree, the
accuracy of the tree measured over the training examples increases monotonically. However,
when measured over a set of test examples independent of the training examples, accuracy
first increases, then decreases.
Why Overfitting Happens in
Decision Tree Learning?
Presence of error in the training examples.

(In general in machine learning)
When small numbers of examples are associated

with leaf node.
Presence of Error and Over-fitting
More Complex
Tree depth is more
How to avoid Overfitting
Stop growing the tree earlier, before it

reaches the point where it perfectly
classifies the training data
Allow the tree to overfit the data, and then

post-prune the tree.
Post-pruning overfit trees has been found

to be more successful in practice. This is
due to the difficulty in the first approach of
estimating precisely when to stop growing
the tree.
Regardless of whether the correct tree size

is found by stopping early or by post-
pruning, a key question is what criterion is
to be used to determine the correct final
tree size.
Determining correct final tree size
Use a separate set of examples for training and testing.
[Training and Validation] <for pruning method>
Use all the available data for training, but apply a

statistical test (for e.g., Chi-square test) to estimate
whether expanding (or pruning) a particular node is
likely to produce an improvement beyond the training
set. <for pruning method>
Use an explicit measure of the complexity for encoding

the training examples and the decision tree, halting
growth of the tree when this encoding size is
minimized. This approach, based on a heuristic called
the Minimum Description Length principle (MDL).
Pruning Methods
Reduced-error pruning (Quinlan 1987)
Rule post-pruning (Quinlan 1993)

Reduced Error Pruning
Pruning a decision node consists of removing the

subtree rooted at that node, making it a leaf node,
and assigning it the most common classification of
the training examples affiliated with that node.
Nodes are removed only if the resulting pruned

tree performs no worse than-the original over the
validation set.
Drawback of Training and
Validation Method
Using a separate set of data to guide pruning is an

effective approach provided a large amount of data is
available. The major drawback of this approach is
that when data is limited.
Rule Post-Pruning
In practice, it is one quite successful method for finding

high accuracy hypotheses in post-pruning of decision
tree.
Rule Post-Pruning (Step 1)
1
1: IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No

2: IF (Outlook = sunny and Temperature = Cold) THEN PlayTennis = Yes
3: IF (Outlook = sunny and Temperature = Mild and Humidity=High) THEN PlayTennis = No
4: IF (Outlook = sunny and Temperature = Mild and Humidity=Normal) THEN PlayTennis = Yes
5: IF (Outlook = overcast) THEN PlayTennis = Yes
6: IF (Outlook = rain and Wind = Strong) THEN PlayTennis = No
7: IF (Outlook = rain and Wind = Weak) THEN PlayTennis = Yes
3
IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No
IF (Outlook = sunny) THEN PlayTennis = No Test Dataset

(Validation
examples)
IF (Temperature = Hot) THEN PlayTennis = No
IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No

Acc1
IF (Outlook = sunny) THEN PlayTennis = No

Acc2 Test Dataset
(Validation
Acc3 examples)
If Acc3 > Acc2 & Acc1

4
S1: Acc1
R1: Acc1
S2: Acc2
R2: Acc2 Sort rules in descending order S3: Acc3
R3: Acc3 of their accuracy on test S4: Acc4
R4: Acc4 dataset or validation examples
.
.
.
.
.
.
S11: Acc11
R11: Acc11
S12: Acc12
R12: Acc12
S13: Acc13
R13: Acc13
S14: Acc14
R14: Acc14
S1: Acc1 >= S2: Acc2 >= S3: Acc3 >= S4: Acc4 >= >= S11: Acc11 >= S12: Acc12 >= S13:
Acc13 >= S14: Acc14
Handling Continuous-Valued
Attribute
Attribute
Attribute
We have dynamically defining new discrete valued attributes so

that it partition the continuous attribute value into a discrete set
of intervals.
Alternative Measures for Selecting
Attributes
There is a natural bias in the information gain measure that
favors attributes with many values over those with few values.
Consider the attribute Date, which has a very large number of possible
values (e.g., March 11,2008).
If we were to add this as a attribute to the data, it would have the

highest information gain of any of the attributes. This is because Date
alone perfectly predicts the target attribute over the training data. Thus,
it would be selected as the decision attribute for the root node of the
tree and lead to a (quite broad) tree of depth one, which perfectly
classifies the training data.
However, this decision tree would fare poorly on subsequent examples,

because it is not a useful predictor despite the fact that it perfectly
separates the training data.
Attributes
What is wrong with the attribute Date?

It has so many possible values that it is bound to separate the
training examples into very small subsets. Because of this, it will
have a very high information gain relative to the training
examples, despite being a very poor predictor of the target
function over unseen instances.
One way to avoid this difficulty is to select decision attributes based

on some measure other than information gain. One alternative
measure that has been used successfully is the gain ratio (Quinlan
1986). The gain ratio measure penalizes attributes such as Date by
incorporating a term, called split information, that is sensitive to how
broadly and uniformly the attribute splits the data.
Attributes
What is wrong with the attribute Date?

It has so many possible values that it is bound to separate the
training examples into very small subsets. Because of this, it will
have a very high information gain relative to the training
examples, despite being a very poor predictor of the target
function over unseen instances.
One way to avoid this difficulty is to select decision attributes based

on some measure other than information gain. One alternative
measure that has been used successfully is the gain ratio (Quinlan
1986). The gain ratio measure penalizes attributes such as Date by
incorporating a term, called split information, that is sensitive to how
broadly and uniformly the attribute splits the data.
Attributes
where S1 through Sc, are the c subsets of examples resulting from

partitioning S by the c-valued attribute A.
Splitlnformation is actually the entropy of S with respect to the values of

attribute A. This is in contrast to our previous uses of entropy, in which we
considered only the entropy of S with respect to the target attribute whose
value is to be predicted by the learned tree.
Attributes
The Splitlnformation term discourages the selection of attributes with

many uniformly distributed values.
For example, consider a collection of n examples that are completely

separated by attribute A (e.g., Date). In this case, the Splitlnformation
value will be logn. In contrast, a boolean attribute B that splits the same n
examples exactly in half will have Splitlnformation of 1. If attributes A and
B produce the same information gain, then clearly B will score higher
according to the Gain Ratio measure.
Handling Missing Attributes
In certain cases, the available data may be missing values for some
attributes. For example, in a medical domain in which we wish to
predict patient outcome based on various laboratory tests, it may be
that the lab test Blood-Test-Result is available only for a subset of
the patients. In such cases, it is common to estimate the missing
attribute value based on other examples for which this attribute has a
known value.
Handling Missing Attributes
One strategy for dealing with the missing attribute value is to assign
it the value that is most common among training examples at node n.
Alternatively, we might assign it the most common value among

examples at node n that have the classification c(x)
A more complex procedure is to assign a probability to each of the

possible values of A rather than simply assigning the most common
value to A(x). These probabilities can be estimated again based on the
observed frequencies of the various values for A among the examples
at node n. This method for handling missing attribute values is used
in C4.5 (Quinlan 1993).
Handling Attributes with Different
Cost
In some learning tasks the instance attributes may have associated

costs. For example, in learning to classify medical diseases we might
describe patients in terms of attributes such as Temperature,
BiopsyResult, Pulse, BloodTestResults, etc.
These attributes vary significantly in their costs, both in terms of

monetary cost and cost to patient comfort.
In such tasks, we would prefer decision trees that use low-cost

attributes where possible, relying on high-cost attributes only when
needed to produce reliable classifications.
Cost
ID3 can be modified to take into account attribute costs by

introducing a cost term into the attribute selection measure. For
example, we might divide the Gain by the cost of the attribute, so that
lower-cost attributes would be preferred.
However, such cost-sensitive measures do not guarantee finding an

optimal cost-sensitive decision tree, they do bias the search in favor
of low-cost attributes.
Gain( S , A )
Cost( A )
Cost
Tan and Schlimmer (1990) and Tan (1993) describe one such approach
and apply it to a robot perception task in which the robot must learn to
classify different objects according to how they can be grasped by the
robot's manipulator. In this case the attributes correspond to different
sensor readings obtained by a movable sonar on the robot.
Attribute cost is measured by the number of seconds required to obtain

the attribute value by positioning and operating the sonar. They
demonstrate that more efficient recognition strategies are learned,
without sacrificing classification accuracy, by replacing the information
gain attribute selection measure by the following measure.
Cost
Nunez (1988) describes a related approach and its application to

learning medical diagnosis rules. Here the attributes are different
symptoms and laboratory tests with differing costs. His system uses a
somewhat different attribute selection measure

Machine Learning Lecture 3 1205567881737248 5 PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Machine Learning Lecture 3 1205567881737248 5 PDF

Uploaded by

Copyright:

Available Formats

Lecture No.

Recap of ID3 Algorithm

Decision tree learning is a method for approximating

Decision trees can also be represented by if-then-else rule.

Decision tree learning is one of the most widely used

Attribute: A2 Output Attribute: A3

Output Output Output Output

If (Outlook = Sunny AND humidity = Normal) then PlayTennis = Yes

Instances are represented by attribute-value pairs

The target function has discrete output values

Disjunctive descriptions may be required

The training data may contain errors

The training data may contain missing attribute values

Output value Output value

Given a collection S, containing positive and negative examples of

where is the proportion of positive examples in S and p, is the

Information gain, is simply the expected reduction in entropy

More precisely, the information gain, Gain(S, A) of an attribute A,

where Values(A) is the set of all possible values for attribute A,

Gain(S, A) is the expected reduction in entropy caused by knowing the value of

There are 14 examples. 9 positive and 5 negative examples [9+, 5-].

The entropy of S relative to this boolean (yes/no) classification is

ID3 maintain only a single current hypothesis as it searches through

ID3 in its pure form performs no backtracking in its search. Once it

Language Bias/Restriction Bias: Restriction on the

Preference Bias/Search Bias: A preference for certain

Hypothesis was assumed to be conjunction of Attributes

Candidate-Elimination algorithm is Language biased

The problem is the algorithm considers (biased) only conjunctive space.

The following example requires a more expressive hypothesis space

Output value Output value

ID3 algorithm has Preference/Search Bias

Selects trees that place the attributes with highest

Selects in favor of shorter trees over longer ones.

A preference bias is more desirable than a restriction bias,

In contrast, a restriction bias that strictly limits the set of

ID3 exhibits a purely preference bias and CANDIDATE-ELIMINATION

Task T: playing checkers

A linear combination of variables

Preference Bias (Because weights are found based on Least

Determining how deeply to grow the decision tree

The principle states that the

In other words, when multiple competing theories are equal in other

Prefer the simplest hypothesis that fits the data

Tom M. Mitchell says. Occam got this idea during shaving

Wikipedia says.. The term razor refers to the act of shaving

Selects trees that place the attributes with highest

Selects in favor of shorter trees over longer ones.

The size of a hypothesis is determined by the particular

For classification problems, a classifiers performance is

The classifier predicts the class of each instance: if it is correct,

We are interested in is the likely future performance on new

We are not generally interested in learning about those

Error rate on the training set is not likely to be a good

Self-consistency Test: When training and test dataset are same

In practical scenario we have limited

K-fold Cross validation technique:

In the k-fold cross-validation, the dataset was partitioned randomly into k

ACC = (ACC1 + ACC2 + ACC3 + ACC4) / 4

Determining how deeply to grow the decision tree

A hypothesis is said to be over-fitting the training

Is h1 more accurate Is h1 more accurate

No over-fitting Over-fitting No over-fitting