Professional Documents
Culture Documents
Machine Learning Lecture 3 1205567881737248 5 PDF
Machine Learning Lecture 3 1205567881737248 5 PDF
Ravi Gupta
AU-KBC Research Centre,
MIT Campus, Anna University
Date: 12.3.2008
Todays Agenda
Attribute: A1
Attribute Attribute
value Attribute
value
value
Leave node:
Output value
Decision Trees Representation
conjunction
disjunction
Decision Trees as If-then-else rule
conjunction
disjunction
Attribute: A1
Attribute value
Attribute value
Attribute
value
Output value
Attribute: A2 Attribute: A3
Attribute value
Attribute value Attribute value Attribute value
Outlook
Temperature Which attribute to
select ?????
Humidity
Wind
Root
node
Entropy
Entropy of S after
Entropy of S
partition
Gain(S, A) is the information provided about the target &action value, given the
value of some other attribute A. The value of Gain(S, A) is the number of bits
saved when encoding the target value of an arbitrary member of S, by knowing
the value of attribute A.
Example
ID3 uses all training examples at each step in the search to make
statistically based decisions regarding how to refine its current
hypothesis. This contrasts with methods that make decisions
incrementally, based on individual training examples (e.g., FIND-S or
CANDIDATE-ELIMINATION). One advantage of using statistical
properties of all the examples (e.g., information gain) is that the
resulting search is much less sensitive to errors in individual training
examples. [Advantage]
Machine Learning Biases
Attribute: A1
Attribute value
Attribute value
Attribute
value
Output value
Attribute: A2 Attribute: A3
Attribute value
Attribute value Attribute value Attribute value
E(Error)
< b , Ftrain ( b ) > training examples
(Ftrain (b) F '(b))
2
This is often paraphrased as "All other things being equal, the simplest
solution is the best."
Why should simplest hypothesis that fits the data is best solution.
Why not second simplest or third simplest hypothesis.
The error rate is just the proportion of errors made over a whole
set of instances, and it measures the overall performance of the
classifier.
Training and Testing
So the question is, is the error rate on old data likely to be a good
indicator of the error rate on new data?
The answer is a resounding nonot if the old data was used
during the learning process to train the classifier.
Training and Testing
The error rate on the training data is called the resubstitution error,
because it is calculated by resubstituting the training instances into a
classifier that was constructed from them.
Training and Testing
Hold out Strategy: Holdout method reserves a certain amount for
testing and uses the remainder for training (and sets part of that aside
for validation, if required).
ACC1
Test Dataset
Training Dataset
4-Fold Cross-validation
ACC2
Test Dataset
Training Dataset
4-Fold Cross-validation
ACC3
Test Dataset
Training Dataset
4-Fold Cross-validation
ACC4
Test Dataset
Training Dataset
4-Fold Cross-validation
H: Hypothesis Space
Overfitting
Negative
Positive example
example
Overfitting
h1 h2
Overfitting
h1 is more accurate
h1 than h2 on the training h2
examples
Overfitting
h1 is less accurate
h1 than h2 on the unseen h2
(test) examples
Overfitting
Is h1 more accurate
than h2 on training
examples
no
yes
yes No No
yes
Overfitting in decision tree learning. As ID3 adds new nodes to grow the decision tree, the
accuracy of the tree measured over the training examples increases monotonically. However,
when measured over a set of test examples independent of the training examples, accuracy
first increases, then decreases.
Overfitting in Decision Tree
Overfitting in decision tree learning. As ID3 adds new nodes to grow the decision tree, the
accuracy of the tree measured over the training examples increases monotonically. However,
when measured over a set of test examples independent of the training examples, accuracy
first increases, then decreases.
Why Overfitting Happens in
Decision Tree Learning?
More Complex
Tree depth is more
Presence of Error and Over-fitting
How to avoid Overfitting
1
Rule Post-Pruning (Step 2)
3
1: IF (Outlook = sunny and Temperature = Hot) THEN PlayTennis = No
4
S1: Acc1
R1: Acc1
S2: Acc2
R2: Acc2 Sort rules in descending order S3: Acc3
R3: Acc3 of their accuracy on test S4: Acc4
R4: Acc4 dataset or validation examples
.
.
.
.
.
.
S11: Acc11
R11: Acc11
S12: Acc12
R12: Acc12
S13: Acc13
R13: Acc13
S14: Acc14
R14: Acc14
S1: Acc1 >= S2: Acc2 >= S3: Acc3 >= S4: Acc4 >= >= S11: Acc11 >= S12: Acc12 >= S13:
Acc13 >= S14: Acc14
Handling Continuous-Valued
Attribute
Handling Continuous-Valued
Attribute
Handling Continuous-Valued
Attribute
Consider the attribute Date, which has a very large number of possible
values (e.g., March 11,2008).
In certain cases, the available data may be missing values for some
attributes. For example, in a medical domain in which we wish to
predict patient outcome based on various laboratory tests, it may be
that the lab test Blood-Test-Result is available only for a subset of
the patients. In such cases, it is common to estimate the missing
attribute value based on other examples for which this attribute has a
known value.
Handling Missing Attributes
One strategy for dealing with the missing attribute value is to assign
it the value that is most common among training examples at node n.
Gain( S , A )
Cost( A )
Handling Attributes with Different
Cost
Tan and Schlimmer (1990) and Tan (1993) describe one such approach
and apply it to a robot perception task in which the robot must learn to
classify different objects according to how they can be grasped by the
robot's manipulator. In this case the attributes correspond to different
sensor readings obtained by a movable sonar on the robot.