Download as pdf or txt
Download as pdf or txt
You are on page 1of 86

Lecture No.

Ravi Gupta
AU-KBC Research Centre,
MIT Campus, Anna University

Date: 8.3.2008
Todays Agenda

Recap (FIND-S Algorithm)


Version Space
Candidate-Elimination Algorithm
Decision Tree
ID3 Algorithm
Entropy
Concept Learning as Search

Concept learning can be viewed as the task of searching through


a large space of hypothesis implicitly defined by the hypothesis
representation.

The goal of the concept learning search is to find the hypothesis


that best fits the training examples.
General-to-Specific Learning
Every day Tom his
enjoy i.e., Only
positive examples.

Most General Hypothesis: h = <?, ?, ?, ?, ?, ?>

Most Specific Hypothesis: h = < , , , , , >


General-to-Specific Learning

h2 is more general than h1

h2 imposes fewer constraints on the instance than h1


Definition

Given hypotheses hj and hk, hj is more_general_than_or_equal_to


hk if and only if any instance that satisfies hk also satisfies hj.

We can also say that hj is more_specific_than hk when hk is


more_general_than hj.
FIND-S: Finding a Maximally
Specific Hypothesis
Step 1: FIND-S

h0 = <, , , , , >
Step 2: FIND-S

h0 = <, , , , , >

a1 a2 a3 a4 a5 a6

x1 = <Sunny, Warm, Normal, Strong, Warm, Same>

Iteration 1

h1 = <Sunny, Warm, Normal, Strong, Warm, Same>


h1 = <Sunny, Warm, Normal, Strong, Warm, Same>

Iteration 2
x2 = <Sunny, Warm, High, Strong, Warm, Same>

h2 = <Sunny, Warm, ?, Strong, Warm, Same>


Iteration 3 Ignore h3 = <Sunny, Warm, ?, Strong, Warm, Same>
h3 = < Sunny, Warm, ?, Strong, Warm, Same >

Iteration 4
x4 = < Sunny, Warm, High, Strong, Cool, Change >

Step 3

Output h4 = <Sunny, Warm, ?, Strong, ?, ?>


Unanswered Questions by FIND-S

Has the learner converged to the correct target


concept?

Why prefer the most specific hypothesis?

What if the training examples consistent?


Version Space

The set of all valid hypotheses provided by an


algorithm is called version space (VS) with respect
to the hypothesis space H and the given example set
D.
Candidate-Elimination Algorithm

The Candidate-Elimination algorithm finds all describable hypotheses


that are consistent with the observed training examples

Hypothesis is derived from examples regardless of whether x is


positive or negative example
Candidate-Elimination Algorithm

Earlier
(i.e., FIND-S)
Def.
LIST-THEN-ELIMINATE Algorithm
to Obtain Version Space
LIST-THEN-ELIMINATE Algorithm
to Obtain Version Space
Examples
Hypothesis Space

.
Version Space
.
.
.
.
VSH,D
.
H

D
LIST-THEN-ELIMINATE Algorithm
to Obtain Version Space

In principle, the LIST-THEN-ELIMINATE algorithm can be


applied whenever the hypothesis space H is finite.

It is guaranteed to output all hypotheses consistent with the


training data.

Unfortunately, it requires exhaustively enumerating all


hypotheses in H-an unrealistic requirement for all but the most
trivial hypothesis spaces.
Candidate-Elimination Algorithm

The CANDIDATE-ELIMINATION algorithm works on the same


principle as the above LIST-THEN-ELIMINATE algorithm.

It employs a much more compact representation of the version


space.

In this the version space is represented by its most general and


least general members (Specific).

These members form general and specific boundary sets that delimit
the version space within the partially ordered hypothesis space.
Least General
(Specific)

Most General
Candidate-Elimination Algorithm
Example

G0 {<?, ?, ?, ?, ?, ?>}

Initialization

S0 {<, , , , , >}
G0 {<?, ?, ?, ?, ?, ?>}

S0 {<, , , , , >}

x1 = <Sunny, Warm, Normal, Strong, Warm, Same>


Iteration 1
G1 {<?, ?, ?, ?, ?, ?>}

S1 {< Sunny, Warm, Normal, Strong, Warm, Same >}

x2 = <Sunny, Warm, High, Strong, Warm, Same>


Iteration 2
G2 {<?, ?, ?, ?, ?, ?>}

S2 {< Sunny, Warm, ?, Strong, Warm, Same >}


G2 {<?, ?, ?, ?, ?, ?>}

S2 {< Sunny, Warm, ?, Strong, Warm, Same >}

consistent

x3 = <Rainy, Cold, High, Strong, Warm, Change>


Iteration 3
S3 {< Sunny, Warm, ?, Strong, Warm, Same >}

G3 {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>}

G2 {<?, ?, ?, ?, ?, ?>}
S3 {< Sunny, Warm, ?, Strong, Warm, Same >}

G3 {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>}

x4 = <Sunny, Warm, high, Strong, Cool, Change>


Iteration 4
S4 {< Sunny, Warm, ?, Strong, ?, ? >}

G4 {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>}

G3 {<Sunny, ?, ?, ?, ?, ?>, <?, Warm, ?, ?, ?, ?>, <?, ?, ?, ?, Same>}


Remarks on Version Spaces and
Candidate-Elimination

The version space learned by the CANDIDATE-ELIMINATION algorithm


will converge toward the hypothesis that correctly describes the target
concept, provided

(1) there are no errors in the training examples, and

(2) there is some hypothesis in H that correctly


describes the target concept.
What will Happen if the Training
Contains errors ?

No
G0 {<?, ?, ?, ?, ?, ?>}

S0 {<, , , , , >}

x1 = <Sunny, Warm, Normal, Strong, Warm, Same>


Iteration 1
G1 {<?, ?, ?, ?, ?, ?>}

S1 {< Sunny, Warm, Normal, Strong, Warm, Same >}

x2 = <Sunny, Warm, High, Strong, Warm, Same>


Iteration 2
G2 {<?, ?, Normal, ?, ?, ?>}

S2 {< Sunny, Warm, Normal, Strong, Warm, Same >}


G2 {<?, ?, Normal, ?, ?, ?>}

S2 {< Sunny, Warm, Normal, Strong, Warm, Same >}

consistent

x3 = <Rainy, Cold, High, Strong, Warm, Change>


Iteration 3
S3 {< Sunny, Warm, Normal, Strong, Warm, Same >}

G3 {<?, ?, Normal, ?, ?, ?>}


S3 {< Sunny, Warm, Normal, Strong, Warm, Same >}

G3 {<?, ?, Normal, ?, ?, ?>}

x4 = <Sunny, Warm, high, Strong, Cool, Change>


Iteration 4
S4 { }
Empty

G4 { }

G3 {<?, ?, Normal, ?, ?, ?>}


What will Happen if Hypothesis
is not Present ?
Remarks on Version Spaces and
Candidate-Elimination

The target concept is exactly learned when


the S and G boundary sets converge to a
single, identical, hypothesis.
Remarks on Version Spaces and
Candidate-Elimination

How Can Partially Learned Concepts Be Used?


Suppose that no additional training examples are available beyond
the four in our example. And the learner is now required to classify
new instances that it has not yet observed.

The target concept is exactly learned when


the S and G boundary sets converge to a
single, identical, hypothesis.
Remarks on Version Spaces and
Candidate-Elimination
Remarks on Version Spaces and
Candidate-Elimination

All six hypotheses satisfied

All six hypotheses satisfied


Remarks on Version Spaces and
Candidate-Elimination

Three hypotheses satisfied


Three hypotheses not satisfied

Two hypotheses satisfied


Four hypotheses not satisfied
Remarks on Version Spaces and
Candidate-Elimination

Yes
No
Decision Trees
Decision Trees

Decision tree learning is a method for approximating


discrete value target functions, in which the learned function
is represented by a decision tree.

Decision trees can also be represented by if-then-else rule.

Decision tree learning is one of the most widely used


approach for inductive inference .
Decision Trees

An instance is classified by starting at the root node of the tree, testing the
attribute specified by this node, then moving down the tree branch
corresponding to the value of the attribute in the given example. This process
is then repeated for the subtree rooted at the new node.
Decision Trees

<Outlook = Sunny, Temperature = Hot, Humidity = High, Wind = Strong>

PlayTennis = No
Decision Trees
Edges: Attribute
value
Intermediate
Nodes: Attributes

Attribute: A1
Attribute Attribute
value Attribute
value
value

Attribute: A2 Output Attribute: A3


value
Attribute
Attribute Attribute Attribute
value
value value value

Output Output Output Output


value value value value

Leave node:
Output value
Decision Trees

conjunction
disjunction

Decision trees represent a disjunction of conjunctions of constraints


on the attribute values of instances.

Each path from the tree root to a leaf corresponds to a conjunction of


attribute tests, and the tree itself to a disjunction of these
conjunctions.
Decision Trees
Decision Trees (F = A ^ B')
F = A ^ B
If (A=True and B = False) then Yes
else
No

If then else form


A
False True

No
B
False True

Yes No
Decision Trees (F = A V (B ^ C))

If (A=True) then Yes


else if (B = True and C=True) then Yes If then else form
else No

A
True False

Yes
B
False True

No C
False True

No Yes
Decision Trees (F = A XOR B)
F = (A ^ B') V (A' ^ B)

If (A=True and B = False) then Yes


If then else form
else If (A=False and B = False) then Yes
else No

A
False True

B B
False False True
True

No Yes No
Yes
Decision Trees as If-then-else rule
conjunction
disjunction

If (Outlook = Sunny AND humidity = Normal) then PlayTennis = Yes


If (Outlook = Overcast) then PlayTennis = Yes
If (Outlook = Rain AND Wind = Weak) then PlayTennis = Yes
Problems Suitable for Decision Trees

Instances are represented by attribute-value pairs


Instances are described by a fixed set of attributes (e.g., Temperature) and
their values (e.g., Hot). The easiest situation for decision tree learning is when
each attribute takes on a small number of disjoint possible values (e.g., Hot,
Mild, Cold). However, extensions to the basic algorithm allow handling real-
valued attributes as well (e.g., representing Temperature numerically).

The target function has discrete output values

Disjunctive descriptions may be required

The training data may contain errors

The training data may contain missing attribute values


Basic Decision Tree Learning Algorithm

ID3 Algorithm (Quinlan 1986) and its


successors C4.5 and C5.0

Employs a top-down
An instance is classified by starting at the root
node of the tree, testing the attribute specified
by this node, then moving down the tree
branch corresponding to the value of the
attribute in the given example. This process is
then repeated for the subtree rooted at the
new node.

Greedy search the space of possible


http://www.rulequest.com/Personal/
decision trees.
The algorithm never backtracks to
reconsider earlier choices.
ID3 Algorithm
Example
Attributes

Attributes are Outlook, Temperature, Humidity, Wind


Building Decision Tree
Building Decision Tree

Attribute: A1
Attribute value
Attribute value
Attribute
value

Output value
Attribute: A2 Attribute: A3

Attribute value
Attribute value Attribute value Attribute value

Output value Output value


Output value Output value
Building Decision Tree

Outlook
Temperature Which attribute to
select ?????
Humidity
Wind
Root
node
Which Attribute to Select ??

We would like to select the attribute that is most useful for


classifying examples.

What is a good quantitative measure of the worth of an


attribute?

ID3 uses this information gain measure to select among the


candidate attributes at each step while growing the tree.
Information Gain

Information gain is based on information theory concept called Entropy

Nothing in life is certain except death,


taxes and the second law of
thermodynamics. All three are
processes in which useful or
accessible forms of some quantity,
such as energy or money, are
transformed into useless, inaccessible
forms of the same quantity. That is not
to say that these three processes dont
have fringe benefits: taxes pay for
Rudolf Julius Emanuel roads and schools; the second law of
Claude Elwood
Clausius (January 2, thermodynamics drives cars,
Shannon (April 30,
1822 August 24, 1888), 1916 February 24, computers and metabolism; and death,
was a German physicist 2001), an American at the very least, opens up tenured
and mathematician and electrical engineer and faculty positions
is considered one of the mathematician, has
central founders of the been called "the father
Seth Lloyd, writing in Nature 430,
science of of information theory" 971 (26 August 2004).
thermodynamics
Entropy

In information theory, the Shannon entropy or


information entropy is a measure of the uncertainty
associated with a random variable.

It quantifies the information contained in a


message, usually in bits or bits/symbol.

It is the minimum message length necessary to


communicate information.
Why Shannon named his uncertainty
function "entropy ?

John von
Neumann

My greatest concern was what to call it. I thought of calling it 'information,' but the
word was overly used, so I decided to call it 'uncertainty.' When I discussed it with
John von Neumann, he had a better idea. Von Neumann told me, 'You should call
it entropy, for two reasons. In the first place your uncertainty function has
been used in statistical mechanics under that name, so it already has a name.
In the second place, and more important, no one really knows what entropy
really is, so in a debate you will always have the advantage.'
Shannon's mouse

Shannon and his famous


electromechanical mouse
Theseus, named after the Greek
mythology hero of Minotaur and
Labyrinth fame, and which he
tried to teach to come out of the
maze in one of the first
experiments in artificial
intelligence.
Entropy

The information entropy of a discrete random variable X, that can take on


possible values {x1...xn} is

where
I(X) is the information content or self-information of X, which is itself a
random variable; and
p(xi) = Pr(X=xi) is the probability mass function of X.
Entropy in our Context

Given a collection S, containing positive and negative


examples of some target concept, the entropy of S relative to
this boolean classification (yes/no) is

where is the proportion of positive examples in S and p, is the


proportion of negative examples in S. In all calculations involving
entropy we define 0 log 0 to be 0.
Example

There are 14 examples. 9 positive and 5 negative examples [9+, 5-].

The entropy of S relative to this boolean (yes/no) classification is


Information Gain Measure

Information gain, is simply the expected reduction in entropy


caused by partitioning the examples according to this attribute.

More precisely, the information gain, Gain(S, A) of an attribute A,


relative to a collection of examples S, is defined as

where Values(A) is the set of all possible values for attribute A,


and Sv, is the subset of S for which attribute A has value v, i.e.,
Information Gain Measure

Entropy of S after
Entropy of S
partition

Gain(S, A) is the expected reduction in entropy caused by knowing the value of


attribute A.

Gain(S, A) is the information provided about the target &action value, given the
value of some other attribute A. The value of Gain(S, A) is the number of bits
saved when encoding the target value of an arbitrary member of S, by knowing
the value of attribute A.
Example

There are 14 examples. 9 positive and 5 negative examples [9+, 5-].

The entropy of S relative to this boolean (yes/no) classification is


Gain (S, Attribute = Wind)
Gain (S,A)
Gain (SSunny,A)

Temperature Humidity Wind


(Hot) {0+, 2-) (High) {0+, 3-} (Weak) {1+, 2-}
(Mild) {1+, 1-} (Normal) {2+, 0-} (Strong) {1+, 1-}
(Cool) {1+, 0-}
Gain (SSunny,A)
Entropy(SSunny) = - { 2/5 log(2/5) + 3/5 log(3/5)} = 0.97095

Entropy(Hot) = 0
Temperature
(Hot) {0+, 2-) Entropy(Mild) = 1
(Mild) {1+, 1-} Entropy(Cool) = 0
(Cool) {1+, 0-} Gain(S1, Temperature) = 0.97095 2/5*0 2/5*1 1/5*0 = 0.57095

Humidity Entropy(High) = 0
(High) {0+, 3-} Entropy(Normal) = 0
(Normal) {2+, 0-} Gain(S1, Humidity) = 0.97095 3/5*0 2/5*0 = 0.97095

Entropy(Weak) = 0.9183
Wind
(Weak) {1+, 2-} Entropy(Normal) = 1.0
(Strong) {1+, 1-} Gain(S1, Wind) = 0.97095 3/5*0.9183 2/5*1 = 0.01997
Modified Decision Tree
Gain (SRain,A)

Temperature
Humidity Wind
(Hot) {0+, 0-)
(High) {1+, 1-} (Weak) {3+, 0-}
(Mild) {2+, 1-}
(Normal) {2+, 1-} (Strong) {0+, 2-}
(Cool) {1+, 1-}
Gain (SRain,A)
Entropy(SRain) = - { 2/5 log(2/5) + 3/5 log(3/5)} = 0.97095

Entropy(Hot) = 0
Temperature
(Hot) {0+, 0-) Entropy(Mild) = 0.1383
(Mild) {2+, 1-} Entropy(Cool) = 1.0
(Cool) {1+, 1-} Gain(S1, Temperature) = 0.97095 0 2/3*0.1383 - 2/5*1 = 0.4922

Humidity Entropy(High) = 1.0


(High) {1+, 1-} Entropy(Normal) = 0.1383
(Normal) {2+, 1-} Gain(S1, Humidity) = 0.97095 2/5*1.0 3/5*0.1383 = 0.4922

Entropy(Weak) = 0.0
Wind
(Weak) {3+, 0-} Entropy(Normal) = 0.0
(Strong) {0+, 2-} Gain(S1, Humidity) = 0.97095 - 3/5*0 2/5*0 = 0.97095
Final Decision Tree
Home work
Home work
Home work
a1
(True) {2+, 1-}
(False) {1+, 2-}

Entropy(a1=True) = -{2/3log(2/3) + 1/3log(1/3)} = 0.9183


Entropy(a1=False) = 0.9183
Gain (S, a1) = 1 3/6*0.9183 3/6*0.9183 = 0.0817 S {3+, 3-} => Entropy(S) = 1

a2 Entropy(a2=True) = 1.0
(True) {2+, 2-} Entropy(a1=False) = 1.0
(False) {1+, 1-} Gain (S, a1) = 1 4/6*1 -2/6*1 = 0.0
Home work

a1

True False

[D1, D2, D3]


[D4, D5, D6]
Home work

a1

True False

[D1, D2, D3]


[D4, D5, D6]
a2
a2
True False
True False

+ (Yes)
- (No) - (No)
+ (Yes)
Home work
a1

True False

[D1, D2, D3]


[D4, D5, D6]
a2
a2
True False
True False

+ (Yes)
- (No) - (No)
+ (Yes)

(a1^a2) V (a1' ^ a2')


Some Insights into Capabilities and
Limitations of ID3 Algorithm
ID3s algorithm searches complete hypothesis space. [Advantage]

ID3 maintain only a single current hypothesis as it searches through


the space of decision trees. By determining only as single
hypothesis, ID3 loses the capabilities that follows explicitly
representing all consistent hypothesis. [Disadvantage]

ID3 in its pure form performs no backtracking in its search. Once it


selects an attribute to test at a particular level in the tree, it never
backtracks to reconsider this choice. Therefore, it is susceptible to
the usual risks of hill-climbing search without backtracking:
converging to locally optimal solutions that are not globally optimal.
[Disadvantage]
Some Insights into Capabilities and
Limitations of ID3 Algorithm

ID3 uses all training examples at each step in the search to make
statistically based decisions regarding how to refine its current
hypothesis. This contrasts with methods that make decisions
incrementally, based on individual training examples (e.g., FIND-S
or CANDIDATE-ELIMINATION). One advantage of using statistical
properties of all the examples (e.g., information gain) is that the
resulting search is much less sensitive to errors in individual training
examples. [Advantage]

You might also like