Lecture No. 2: AU-KBC Research Centre, MIT Campus, Anna University

Lecture No.
Ravi Gupta
AU-KBC Research Centre,
MIT Campus, Anna University
Date: 8.3.2008
Todays Agenda
Recap (FIND-S Algorithm)

Version Space
Candidate-Elimination Algorithm
Decision Tree
ID3 Algorithm
Entropy
Concept Learning as Search
Concept learning can be viewed as the task of searching through

a large space of hypothesis implicitly defined by the hypothesis
representation.
The goal of the concept learning search is to find the hypothesis

that best fits the training examples.
General-to-Specific Learning
Every day Tom his
enjoy i.e., Only
positive examples.
Most General Hypothesis: h = <?, ?, ?, ?, ?, ?>
Most Specific Hypothesis: h = < , , , , , >

General-to-Specific Learning
h2 is more general than h1
h2 imposes fewer constraints on the instance than h1

Definition
Given hypotheses hj and hk, hj is more_general_than_or_equal_to

hk if and only if any instance that satisfies hk also satisfies hj.
We can also say that hj is more_specific_than hk when hk is

more_general_than hj.
FIND-S: Finding a Maximally
Specific Hypothesis
Step 1: FIND-S
h0 = <, , , , , >
Step 2: FIND-S
h0 = <, , , , , >
a1 a2 a3 a4 a5 a6
x1 = <Sunny, Warm, Normal, Strong, Warm, Same>
Iteration 1
h1 = <Sunny, Warm, Normal, Strong, Warm, Same>

h1 = <Sunny, Warm, Normal, Strong, Warm, Same>
Iteration 2
x2 = <Sunny, Warm, High, Strong, Warm, Same>
h2 = <Sunny, Warm, ?, Strong, Warm, Same>

Iteration 3 Ignore h3 = <Sunny, Warm, ?, Strong, Warm, Same>
h3 = < Sunny, Warm, ?, Strong, Warm, Same >
Iteration 4
x4 = < Sunny, Warm, High, Strong, Cool, Change >
Step 3
Output h4 = <Sunny, Warm, ?, Strong, ?, ?>

Unanswered Questions by FIND-S
Has the learner converged to the correct target

concept?
Why prefer the most specific hypothesis?
What if the training examples consistent?

Version Space
The set of all valid hypotheses provided by an

algorithm is called version space (VS) with respect
to the hypothesis space H and the given example set
D.
The Candidate-Elimination algorithm finds all describable hypotheses

that are consistent with the observed training examples
Hypothesis is derived from examples regardless of whether x is

positive or negative example
Earlier
(i.e., FIND-S)
Def.
LIST-THEN-ELIMINATE Algorithm
to Obtain Version Space
Examples
Hypothesis Space
.
Version Space
.
.
.
.
VSH,D
.
H
D
In principle, the LIST-THEN-ELIMINATE algorithm can be

applied whenever the hypothesis space H is finite.
It is guaranteed to output all hypotheses consistent with the

training data.
Unfortunately, it requires exhaustively enumerating all

hypotheses in H-an unrealistic requirement for all but the most
trivial hypothesis spaces.
The CANDIDATE-ELIMINATION algorithm works on the same

principle as the above LIST-THEN-ELIMINATE algorithm.
It employs a much more compact representation of the version

space.
In this the version space is represented by its most general and

least general members (Specific).
These members form general and specific boundary sets that delimit
the version space within the partially ordered hypothesis space.
Least General
(Specific)
Most General
Example
G0 {<?, ?, ?, ?, ?, ?>}
Initialization
S0 {<, , , , , >}
G0 {<?, ?, ?, ?, ?, ?>}
S0 {<, , , , , >}

Iteration 1
G1 {<?, ?, ?, ?, ?, ?>}
S1 {< Sunny, Warm, Normal, Strong, Warm, Same >}

Iteration 2
G2 {<?, ?, ?, ?, ?, ?>}
S2 {< Sunny, Warm, ?, Strong, Warm, Same >}

G2 {<?, ?, ?, ?, ?, ?>}
consistent
x3 = <Rainy, Cold, High, Strong, Warm, Change>

Iteration 3
G3 {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>}
G2 {<?, ?, ?, ?, ?, ?>}
G3 {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>}
x4 = <Sunny, Warm, high, Strong, Cool, Change>

Iteration 4
S4 {< Sunny, Warm, ?, Strong, ?, ? >}
G4 {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>}
G3 {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>}

Remarks on Version Spaces and
Candidate-Elimination
The version space learned by the CANDIDATE-ELIMINATION algorithm

will converge toward the hypothesis that correctly describes the target
concept, provided
(1) there are no errors in the training examples, and
(2) there is some hypothesis in H that correctly

describes the target concept.
What will Happen if the Training
Contains errors ?
No
G0 {<?, ?, ?, ?, ?, ?>}
S0 {<, , , , , >}

Iteration 1
G1 {<?, ?, ?, ?, ?, ?>}

Iteration 2
G2 {<?, ?, Normal, ?, ?, ?>}

G2 {<?, ?, Normal, ?, ?, ?>}
consistent
x3 = <Rainy, Cold, High, Strong, Warm, Change>

Iteration 3
G3 {<?, ?, Normal, ?, ?, ?>}

G3 {<?, ?, Normal, ?, ?, ?>}
x4 = <Sunny, Warm, high, Strong, Cool, Change>

Iteration 4
S4 { }
Empty
G4 { }
G3 {<?, ?, Normal, ?, ?, ?>}

What will Happen if Hypothesis
is not Present ?
The target concept is exactly learned when

the S and G boundary sets converge to a
single, identical, hypothesis.
How Can Partially Learned Concepts Be Used?

Suppose that no additional training examples are available beyond
the four in our example. And the learner is now required to classify
new instances that it has not yet observed.
The target concept is exactly learned when

the S and G boundary sets converge to a
single, identical, hypothesis.
All six hypotheses satisfied
All six hypotheses satisfied

Three hypotheses satisfied

Three hypotheses not satisfied
Two hypotheses satisfied

Four hypotheses not satisfied
Yes
No
Decision Trees
Decision Trees
Decision tree learning is a method for approximating

discrete value target functions, in which the learned function
is represented by a decision tree.
Decision trees can also be represented by if-then-else rule.
Decision tree learning is one of the most widely used

approach for inductive inference .
Decision Trees
An instance is classified by starting at the root node of the tree, testing the
attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute in the given example. This process
is then repeated for the subtree rooted at the new node.
Decision Trees
<Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong>
PlayTennis = No
Decision Trees
Edges: Attribute
value
Intermediate
Nodes: Attributes
Attribute: A1
Attribute Attribute
value Attribute
value
value
Attribute: A2 Output Attribute: A3

value
Attribute
Attribute Attribute Attribute
value
value value value
Output Output Output Output

value value value value
Leave node:
Output value
Decision Trees
conjunction
disjunction
Decision trees represent a disjunction of conjunctions of constraints

on the attribute values of instances.
Each path from the tree root to a leaf corresponds to a conjunction of

attribute tests, and the tree itself to a disjunction of these
conjunctions.
Decision Trees
Decision Trees (F = A ^ B')
F = A ^ B
If (A=True and B = False) then Yes
else
No
If then else form

A
False True
No
B
False True
Yes No
Decision Trees (F = A V (B ^ C))
If (A=True) then Yes

else if (B = True and C=True) then Yes If then else form
else No
A
True False
Yes
B
False True
No C
False True
No Yes
Decision Trees (F = A XOR B)
F = (A ^ B') V (A' ^ B)
If (A=True and B = False) then Yes

If then else form
else If (A=False and B = False) then Yes
else No
A
False True
B B
False False True
True
No Yes No
Yes
Decision Trees as If-then-else rule
conjunction
disjunction
If (Outlook = Sunny AND humidity = Normal) then PlayTennis = Yes

If (Outlook = Overcast) then PlayTennis = Yes
If (Outlook = Rain AND Wind = Weak) then PlayTennis = Yes
Problems Suitable for Decision Trees
Instances are represented by attribute-value pairs

Instances are described by a fixed set of attributes (e.g., Temperature) and
their values (e.g., Hot). The easiest situation for decision tree learning is when
each attribute takes on a small number of disjoint possible values (e.g., Hot,
Mild, Cold). However, extensions to the basic algorithm allow handling real-
valued attributes as well (e.g., representing Temperature numerically).
The target function has discrete output values
Disjunctive descriptions may be required
The training data may contain errors
The training data may contain missing attribute values

Basic Decision Tree Learning Algorithm
ID3 Algorithm (Quinlan 1986) and its

successors C4.5 and C5.0
Employs a top-down
An instance is classified by starting at the root
node of the tree, testing the attribute specified
by this node, then moving down the tree
branch corresponding to the value of the
attribute in the given example. This process is
then repeated for the subtree rooted at the
new node.
Greedy search the space of possible

http://www.rulequest.com/Personal/
decision trees.
The algorithm never backtracks to
reconsider earlier choices.
ID3 Algorithm
Example
Attributes
Attributes are Outlook, Temperature, Humidity, Wind

Building Decision Tree
Attribute: A1
Attribute value
Attribute value
Attribute
value
Output value
Attribute: A2 Attribute: A3
Attribute value
Attribute value Attribute value Attribute value
Output value Output value

Output value Output value
Outlook
Temperature Which attribute to
select ?????
Humidity
Wind
Root
node
Which Attribute to Select ??
We would like to select the attribute that is most useful for

classifying examples.
What is a good quantitative measure of the worth of an

attribute?
ID3 uses this information gain measure to select among the

candidate attributes at each step while growing the tree.
Information Gain
Information gain is based on information theory concept called Entropy
Nothing in life is certain except death,

taxes and the second law of
thermodynamics. All three are
processes in which useful or
accessible forms of some quantity,
such as energy or money, are
transformed into useless, inaccessible
forms of the same quantity. That is not
to say that these three processes dont
have fringe benefits: taxes pay for
Rudolf Julius Emanuel roads and schools; the second law of
Claude Elwood
Clausius (January 2, thermodynamics drives cars,
Shannon (April 30,
1822 August 24, 1888), 1916 February 24, computers and metabolism; and death,
was a German physicist 2001), an American at the very least, opens up tenured
and mathematician and electrical engineer and faculty positions
is considered one of the mathematician, has
central founders of the been called "the father
Seth Lloyd, writing in Nature 430,
science of of information theory" 971 (26 August 2004).
thermodynamics
Entropy
In information theory, the Shannon entropy or

information entropy is a measure of the uncertainty
associated with a random variable.
It quantifies the information contained in a

message, usually in bits or bits/symbol.
It is the minimum message length necessary to

communicate information.
Why Shannon named his uncertainty
function "entropy ?
John von
Neumann
My greatest concern was what to call it. I thought of calling it 'information,' but the
word was overly used, so I decided to call it 'uncertainty.' When I discussed it with
John von Neumann, he had a better idea. Von Neumann told me, 'You should call
it entropy, for two reasons. In the first place your uncertainty function has
been used in statistical mechanics under that name, so it already has a name.
In the second place, and more important, no one really knows what entropy
really is, so in a debate you will always have the advantage.'
Shannon's mouse
Shannon and his famous

electromechanical mouse
Theseus, named after the Greek
mythology hero of Minotaur and
Labyrinth fame, and which he
tried to teach to come out of the
maze in one of the first
experiments in artificial
intelligence.
Entropy
The information entropy of a discrete random variable X, that can take on

possible values {x1...xn} is
where
I(X) is the information content or self-information of X, which is itself a
random variable; and
p(xi) = Pr(X=xi) is the probability mass function of X.
Entropy in our Context
Given a collection S, containing positive and negative

examples of some target concept, the entropy of S relative to
this boolean classification (yes/no) is
where is the proportion of positive examples in S and p, is the

proportion of negative examples in S. In all calculations involving
entropy we define 0 log 0 to be 0.
Example
There are 14 examples. 9 positive and 5 negative examples [9+, 5-].
The entropy of S relative to this boolean (yes/no) classification is

Information Gain Measure
Information gain, is simply the expected reduction in entropy

caused by partitioning the examples according to this attribute.
More precisely, the information gain, Gain(S, A) of an attribute A,

relative to a collection of examples S, is defined as
where Values(A) is the set of all possible values for attribute A,

and Sv, is the subset of S for which attribute A has value v, i.e.,
Information Gain Measure
Entropy of S after
Entropy of S
partition
Gain(S, A) is the expected reduction in entropy caused by knowing the value of

attribute A.
Gain(S, A) is the information provided about the target &action value, given the
value of some other attribute A. The value of Gain(S, A) is the number of bits
saved when encoding the target value of an arbitrary member of S, by knowing
the value of attribute A.
Example
There are 14 examples. 9 positive and 5 negative examples [9+, 5-].
The entropy of S relative to this boolean (yes/no) classification is

Gain (S, Attribute = Wind)
Gain (S,A)
Gain (SSunny,A)
Temperature Humidity Wind

(Hot) {0+, 2-) (High) {0+, 3-} (Weak) {1+, 2-}
(Mild) {1+, 1-} (Normal) {2+, 0-} (Strong) {1+, 1-}
(Cool) {1+, 0-}
Gain (SSunny,A)
Entropy(SSunny) = - { 2/5 log(2/5) + 3/5 log(3/5)} = 0.97095
Entropy(Hot) = 0
Temperature
(Hot) {0+, 2-) Entropy(Mild) = 1
(Mild) {1+, 1-} Entropy(Cool) = 0
(Cool) {1+, 0-} Gain(S1, Temperature) = 0.97095 2/5*0 2/5*1 1/5*0 = 0.57095
Humidity Entropy(High) = 0
(High) {0+, 3-} Entropy(Normal) = 0
(Normal) {2+, 0-} Gain(S1, Humidity) = 0.97095 3/5*0 2/5*0 = 0.97095
Entropy(Weak) = 0.9183
Wind
(Weak) {1+, 2-} Entropy(Normal) = 1.0
(Strong) {1+, 1-} Gain(S1, Wind) = 0.97095 3/5*0.9183 2/5*1 = 0.01997
Modified Decision Tree
Gain (SRain,A)
Temperature
Humidity Wind
(Hot) {0+, 0-)
(High) {1+, 1-} (Weak) {3+, 0-}
(Mild) {2+, 1-}
(Normal) {2+, 1-} (Strong) {0+, 2-}
(Cool) {1+, 1-}
Gain (SRain,A)
Entropy(SRain) = - { 2/5 log(2/5) + 3/5 log(3/5)} = 0.97095
Entropy(Hot) = 0
Temperature
(Hot) {0+, 0-) Entropy(Mild) = 0.1383
(Mild) {2+, 1-} Entropy(Cool) = 1.0
(Cool) {1+, 1-} Gain(S1, Temperature) = 0.97095 0 2/3*0.1383 - 2/5*1 = 0.4922
Humidity Entropy(High) = 1.0

(High) {1+, 1-} Entropy(Normal) = 0.1383
(Normal) {2+, 1-} Gain(S1, Humidity) = 0.97095 2/5*1.0 3/5*0.1383 = 0.4922
Entropy(Weak) = 0.0
Wind
(Weak) {3+, 0-} Entropy(Normal) = 0.0
(Strong) {0+, 2-} Gain(S1, Humidity) = 0.97095 - 3/5*0 2/5*0 = 0.97095
Final Decision Tree
Home work
Home work
Home work
a1
(True) {2+, 1-}
(False) {1+, 2-}
Entropy(a1=True) = -{2/3log(2/3) + 1/3log(1/3)} = 0.9183

Entropy(a1=False) = 0.9183
Gain (S, a1) = 1 3/6*0.9183 3/6*0.9183 = 0.0817 S {3+, 3-} => Entropy(S) = 1
a2 Entropy(a2=True) = 1.0
(True) {2+, 2-} Entropy(a1=False) = 1.0
(False) {1+, 1-} Gain (S, a1) = 1 4/6*1 -2/6*1 = 0.0
Home work
a1
True False
[D1, D2, D3]

[D4, D5, D6]
Home work
a1
True False
[D1, D2, D3]

[D4, D5, D6]
a2
a2
True False
True False
+ (Yes)
- (No) - (No)
+ (Yes)
Home work
a1
True False
[D1, D2, D3]

[D4, D5, D6]
a2
a2
True False
True False
+ (Yes)
- (No) - (No)
+ (Yes)
(a1^a2) V (a1' ^ a2')

Some Insights into Capabilities and
Limitations of ID3 Algorithm
ID3s algorithm searches complete hypothesis space. [Advantage]
ID3 maintain only a single current hypothesis as it searches through

the space of decision trees. By determining only as single
hypothesis, ID3 loses the capabilities that follows explicitly
representing all consistent hypothesis. [Disadvantage]
ID3 in its pure form performs no backtracking in its search. Once it

selects an attribute to test at a particular level in the tree, it never
backtracks to reconsider this choice. Therefore, it is susceptible to
the usual risks of hill-climbing search without backtracking:
converging to locally optimal solutions that are not globally optimal.
[Disadvantage]
Some Insights into Capabilities and
Limitations of ID3 Algorithm
ID3 uses all training examples at each step in the search to make
statistically based decisions regarding how to refine its current
hypothesis. This contrasts with methods that make decisions
incrementally, based on individual training examples (e.g., FIND-S
or CANDIDATE-ELIMINATION). One advantage of using statistical
properties of all the examples (e.g., information gain) is that the
resulting search is much less sensitive to errors in individual training
examples. [Advantage]

Lecture No. 2: AU-KBC Research Centre, MIT Campus, Anna University

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lecture No. 2: AU-KBC Research Centre, MIT Campus, Anna University

Uploaded by

Copyright:

Available Formats

Lecture No.

Recap (FIND-S Algorithm)

Concept learning can be viewed as the task of searching through

The goal of the concept learning search is to find the hypothesis

Most General Hypothesis: h = <?, ?, ?, ?, ?, ?>

Most Specific Hypothesis: h = < , , , , , >

h2 is more general than h1

h2 imposes fewer constraints on the instance than h1

Given hypotheses hj and hk, hj is more_general_than_or_equal_to

We can also say that hj is more_specific_than hk when hk is

x1 = <Sunny, Warm, Normal, Strong, Warm, Same>

h1 = <Sunny, Warm, Normal, Strong, Warm, Same>

h2 = <Sunny, Warm, ?, Strong, Warm, Same>

Output h4 = <Sunny, Warm, ?, Strong, ?, ?>

Has the learner converged to the correct target

Why prefer the most specific hypothesis?

What if the training examples consistent?

The set of all valid hypotheses provided by an

The Candidate-Elimination algorithm finds all describable hypotheses

Hypothesis is derived from examples regardless of whether x is

In principle, the LIST-THEN-ELIMINATE algorithm can be

It is guaranteed to output all hypotheses consistent with the

Unfortunately, it requires exhaustively enumerating all

The CANDIDATE-ELIMINATION algorithm works on the same

It employs a much more compact representation of the version

In this the version space is represented by its most general and

x1 = <Sunny, Warm, Normal, Strong, Warm, Same>

S1 {< Sunny, Warm, Normal, Strong, Warm, Same >}

x2 = <Sunny, Warm, High, Strong, Warm, Same>

S2 {< Sunny, Warm, ?, Strong, Warm, Same >}

S2 {< Sunny, Warm, ?, Strong, Warm, Same >}

x3 = <Rainy, Cold, High, Strong, Warm, Change>

G3 {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>}

G3 {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>}

x4 = <Sunny, Warm, high, Strong, Cool, Change>

G4 {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>}

G3 {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>}

The version space learned by the CANDIDATE-ELIMINATION algorithm

(1) there are no errors in the training examples, and

(2) there is some hypothesis in H that correctly

x1 = <Sunny, Warm, Normal, Strong, Warm, Same>

S1 {< Sunny, Warm, Normal, Strong, Warm, Same >}

x2 = <Sunny, Warm, High, Strong, Warm, Same>

S2 {< Sunny, Warm, Normal, Strong, Warm, Same >}

S2 {< Sunny, Warm, Normal, Strong, Warm, Same >}

x3 = <Rainy, Cold, High, Strong, Warm, Change>

G3 {<?, ?, Normal, ?, ?, ?>}

G3 {<?, ?, Normal, ?, ?, ?>}

x4 = <Sunny, Warm, high, Strong, Cool, Change>

G3 {<?, ?, Normal, ?, ?, ?>}

The target concept is exactly learned when

How Can Partially Learned Concepts Be Used?

The target concept is exactly learned when

All six hypotheses satisfied

All six hypotheses satisfied

Three hypotheses satisfied

Two hypotheses satisfied

Decision tree learning is a method for approximating

Decision trees can also be represented by if-then-else rule.

Decision tree learning is one of the most widely used

<Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong>

Attribute: A2 Output Attribute: A3